0% found this document useful (0 votes)
9 views96 pages

SRM Notes

Uploaded by

so hozen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views96 pages

SRM Notes

Uploaded by

so hozen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 96

Social Research Methods

Table of Contents
Econometrics with R................................................................................................................................. 3
Introduction........................................................................................................................................................ 3
Probability Theory...............................................................................................................................................3
A Review of Statistics using R.............................................................................................................................. 6
Regression...........................................................................................................................................................8
Lecture 1: Telling stories with and about Data.......................................................................................... 8
Objectives........................................................................................................................................................... 9
Data Stories.........................................................................................................................................................9
Sex + iPhone Example................................................................................................................................................................ 9
Randomised Control Trials (RCT)..............................................................................................................................................10
Xenophobia in UK Example...................................................................................................................................................... 11
Adding a trendline = Modelling the data............................................................................................................11
True Model driving the data vs. Estimate........................................................................................................... 12
(1) Data Stories Exercise..........................................................................................................................15
Lecture 2: R.............................................................................................................................................17
Zoom 1: R Continued...............................................................................................................................20
Regression.........................................................................................................................................................20
OLS Algorithm................................................................................................................................................... 22
Key R Commands for Data................................................................................................................................. 24
(2) R Exercise.......................................................................................................................................... 26
Lecture 3: Visions....................................................................................................................................27
Example: Soho Cholera Outbreak............................................................................................................................................ 28
What can go wrong?..........................................................................................................................................28
Example: Covid Hoaxism....................................................................................................................................28
Scatter Plots............................................................................................................................................................................. 29
Time Series...............................................................................................................................................................................30
Bar Charts................................................................................................................................................................................ 33
Histograms............................................................................................................................................................................... 34
Density..................................................................................................................................................................................... 34
Maps........................................................................................................................................................................................ 35
D3.............................................................................................................................................................................................35

Lecture 4: Testing Times..........................................................................................................................35


Monte Carlo Experiment....................................................................................................................................36
Histogram & Density..........................................................................................................................................37
Changing the Curves.......................................................................................................................................... 38
Large vs. Small Sample.............................................................................................................................................................38
Variation in Epsilon.................................................................................................................................................................. 39
Non-normal Epsilon................................................................................................................................................................. 40
The Variance of the Estimator..................................................................................................................................................40
1
Anika Madaan
P-Value.............................................................................................................................................................. 42
Significance Levels............................................................................................................................................. 42
T-Statistic...........................................................................................................................................................43
Zoom 2: Tests Continued......................................................................................................................... 44
More or Less Significant Estimates..................................................................................................................... 44
More General Hypothesis Tests..........................................................................................................................45
Significance vs. Bias........................................................................................................................................... 46
Estimation of Variance of Beta...........................................................................................................................47
Critical Values T-distribution.............................................................................................................................. 48
(4) Testing Exercise................................................................................................................................. 48
Lecture 5: Multivariate Regression..........................................................................................................49
Causality vs. All Else Equal................................................................................................................................. 50
More than 2 Variables..............................................................................................................................................................54
Perfect Multi-collinearity...................................................................................................................................55
Accounting for variation: R2............................................................................................................................... 56
R2 = 100.................................................................................................................................................................................... 56
Finding R2................................................................................................................................................................................. 57
Imperfect Multi-Collinearity.............................................................................................................................. 57
Joint Hypothesis Test – F-tests................................................................................................................................................. 57

(5) Bias Exercise...................................................................................................................................... 58


Lecture 6: Econometrics for Dummies..................................................................................................... 60
Dummies as bars............................................................................................................................................... 61
Sets of dummies................................................................................................................................................ 62
Testing the validity of a linear model................................................................................................................. 64
R – how to create more categories?................................................................................................................... 65
Dummies as dependant variables...................................................................................................................... 66
Non-Linear Relationships...................................................................................................................................67
Square Relationship................................................................................................................................................................. 68
Log-Linear Relationships.......................................................................................................................................................... 68
Log-log Model....................................................................................................................................................69
Cobb Douglas Example.............................................................................................................................................................70
Extra: Interactions............................................................................................................................................. 71
Interactions for linear models/continuous variables............................................................................................................... 72

Lecture 7: Instrumental Variables............................................................................................................73


2 Stage Least Squares Estimator (2 SLS)..............................................................................................................73
Schooling Effects Example........................................................................................................................................................74
A graphical representation of IV........................................................................................................................ 75
3 Key Criteria............................................................................................................................................................................75
Control Variables............................................................................................................................................... 76
Weak instrument problem.................................................................................................................................77
Family Size Example...........................................................................................................................................77

2
Anika Madaan
Multiple Instruments.........................................................................................................................................78
(7) Instruments Exercise.......................................................................................................................... 78
Lecture 8: Time Series............................................................................................................................. 78
What’s the challenge of time series data?..........................................................................................................79
COVID vs. GDP Example..................................................................................................................................... 79
What if time is not linear?................................................................................................................................. 80
Covid Hoaxism Example.....................................................................................................................................80
Autoregression.................................................................................................................................................. 80
Internet: Stationarity......................................................................................................................................... 81
Dickey-Fuller Test...............................................................................................................................................81
Using R.............................................................................................................................................................. 81
Getting Rid of UNIT ROOTS................................................................................................................................ 83
Lecture 9: Learning like a Machine..........................................................................................................84
Beyond accuracy................................................................................................................................................85
Evaluating how often and how is my model wrong................................................................................................................. 85
Confusion Matrix...............................................................................................................................................85
In R:.......................................................................................................................................................................................... 86
Learning irrelevant details of the dataset: Overfitting........................................................................................ 86
Preventing overfit: Train – Validation.......................................................................................................................................87
Decision Tree..................................................................................................................................................... 88
Lecture 10: Loose Ends............................................................................................................................ 90

Econometrics with R

Introduction
- Why R?
o Reproducible
o Many add-ons available on CRAN (comprehensive R archive network)
o Easy to update and expand
- R
o > is prompt
▪ Code entered will be executed
o Objects are defined with <-
▪ E.g., x<-10 means x=10
▪ X is a vector
▪ Vectors can be multiple numbers e.g., 1-5 or text e.g., “Hello”
o Functions
▪ Function name is always followed by ()
▪ E.g., seq() = sequence function

Probability Theory
Basic concepts
- Mutually exclusive results of random process = outcomes
3
Anika Madaan
o Mutually exclusive = means only 1 of possible outcomes can happen
- Probability = proportion of outcome occurring in long run if experiment is repeated many times
- Set of all possible outcomes of random variable = sample space
- Event = subset of sample space; 1 or more outcomes
- Random variable = numerical summary of random outcomes
o Can be discrete or continuous
▪ Discrete = numbers e.g., 0 and 1
▪ Continuous = continuum of possible values

Dice = random

Bernoulli = variable with 2 distinct possible outcomes


- E.g., single coin toss result

-
- Where n = number of coin tosses e.g., 10 and p = probability so 0.5
- So k = 5 Heads
- Write this as P(k=5)

Binomial distribution = distribution of probability of the outcomes of an experiment

Expected value of a random variable = average value of outcomes when running for a repeated number of
trials
- For discrete variable = weighted average number of possible outcomes
o Weights are related probabilities

Random sampling in R
Sample (1:6, 3, replace = T)
^ Dice = 1:6; x3 samples; replace is putting the number back

R is actually pseudo-random number generator (not really random but close enough)

What is Variance and Standard Deviation?

Standard Deviation of discrete random variable Y = deviation of random variable from its mean
4
Anika Madaan
- The SqRoot of Variance
Variance of discrete random variable Y = squared deviation of a random variable from its mean

Sample variance = how observations are dispersed around the sample average
Population variance = dispersion of whole population around average

Probability distribution of continuous random variables?


Takes on continuum of possible values so different to discrete variables
Summarised by probability density function (PDF)
- Cumulative probability distribution function (CDF) = the probability that the random variable is < or
= a particular value

The normal distribution


- Symmetrical
- Bell-shaped

-
- Can obtain density at different positions using:
o Dnorm ( x = c(…,…)

The chi-squared distribution


- Testing special hypotheses
- Encountered when dealing with regression models

The student t distribution


- Symmetric and bell-shaped
- Look similar to normal when large

The F distribution
- Degrees of freedom
- Related to other distributions

Random Sampling = Objects drawn at random from a population – each object is equally likely to end up in
the sample
- Any function of two random variables = also random

5
Anika Madaan
Average of a random sample 🡪 is a random variable itself
- This random variable has probability distribution = sampling distribution

To examine the distribution of univariate numerical data: plot it as a histogram and compare it to some
known or assumed distribution
- Will give frequency histogram

A Review of Statistics using R

Estimator = functions of sample data from unknown population


- Random variables because functions of random data

Estimates = numeric values computed by estimators based on sample data


- Non-random numbers

Unbiasedness = If the mean of the sampling distribution of some estimator ^μY for the population
mean μY equals μY

6
Anika Madaan
Consistency = the uncertainty of the estimator μY to decrease as the number of observations in the sample
grows 🡪 probability to get closer to 1

Hypothesis = simple qs. that can be answered by YES or NO


- Deal with null hypothesis = H0 (this is what we are testing)
- Alternative hypothesis = H1 (hold if null is rejected)

p-value = probability of drawing data and observing statistic that is at least as adverse as what is stated
under the null hypothesis as the test statistic actually computed using the sample data

Standard error of mean Y = estimator of standard deviation of mean Y

In hypothesis testing, two types of mistakes are possible:


1. The null hypothesis is rejected although it is true (type-I-error)
2. The null hypothesis is not rejected although it is false (type-II-error)

The significance level of the test is the probability to commit a type-I-error we are willing to accept in
advance.
- E.g., using a prespecified significance level of 0.05, we reject the null hypothesis if and only if
the p-value is less than 0.05. The significance level is chosen before the test is conducted.

An equivalent procedure is to reject the null hypothesis if the observed test statistic is, in absolute value
terms, larger than the critical value of the test statistic.
- The critical value is determined by the significance level chosen and defines two disjoint sets of
values which are called acceptance region and rejection region.
- The acceptance region contains all values of the test statistic for which the test does not reject
- While the rejection region contains all the values for which the test does reject

The p-value is the probability that, in repeated sampling under the same conditions a test statistic is
observed that provides just as much evidence against the null hypothesis as the test statistic actually
observed.

The actual probability that the test rejects the true null hypothesis is called the size of the test. In an ideal
setting, the size does equals the significance level.

7
Anika Madaan
The probability that the test correctly rejects a false null hypothesis is called power.

A 95% confidence interval for μY is a random variable that contains the true μY in 95% of all possible
random samples.

< two sample t-test can calculate

Like variance, covariance and correlation of 2 variables = properties that relate to the (unknown) joint
probability distribution of these variables
- Estimate covariance & correlation w/suitable estimators using a sample
- Sample covariance = estimator for population variance of X and Y
- Sample correlation = can be used to estimate population correlation

Regression

To build simple linear regression model, we hypothesize that the relationship between dependent and
independent variable is linear, formally:

8
Anika Madaan
Lecture 1: Telling stories with and about Data

Objectives
- Learn basic data analysis tools for economics & business
- Learn commands on basic statistical software
- Use data analysis as a decision-making tool

Data Stories
Base decisions on evidence – can be anecdotes/case studies
- These days we have vast amounts of data – BIG DATA
o E.g., consumer activity
o Country data
- Need powerful tools to make sense of this
- Economics + statistics = econometrics

Need to tell a story with data – what is the data telling you? Imagination

Model = description (math or verbal) of relationships you can measure

Sex + iPhone Example


“More sex? Get an iPhone”
- Where did this come from?
- Data
o Dating site = OKCupid
o Data: how many sexual partners by a certain age
o iPhone users had many more
- It’s about causal relationships

-
9
Anika Madaan
- Arrows represent causation
- Think about mechanism – why?
o Is it wealth, taste, style, more iPhone dating apps?
o People with iPhones are people who brag more? Will they exaggerate their sex as well as
their phone?
o iPhone users meet more people? Trendy?
o Android users too busy?
o Having more money? – Drives both sex and having an iPhone

o
o If there is no TRUE +ve relationship between iPhone and sex; it is SPURIOUS relationship
o Or if there is truly a +ve relationship due to another factor = CONFOUNDING factor
o Or this relationship found could be bigger than this causal relationship (e.g., there is a small
relationship between iPhones and sex but the money effect makes is seem stronger; because
money is what actually affects it) = UPWARD BIAS
o Upward bias: our estimated effect is larger than the true one
o There could be other things going on e.g., work hard 🡪 leads to more money which leads to
iPhones BUT hard work also leads to less time sooooo less sex???
▪ is it negatively correlated?! But not what the data found – but there’s lots of things
going on in data
▪ or if it’s the case that iPhone users actually have even more sex

▪ = Downward bias: our estimated effect is smaller than the true one

How can we get a good estimate?


🡪 An unbiased estimate
🡪 To work this out, would need a dataset where iPhone ownership is not influenced by all these
confounding factors (e.g., money, hard work etc.)

Endogeneity Problem: the above problem is driven by factors that are endogenous to the variables
(money, work etc.)
🡪 Need exogenous variation instead
🡪 Where does that come from?

Randomised Control Trials (RCT)


- These are increasingly common in economics/business research
- OBV used in pharmaceuticals/medicine – give control group
o 50% of people get given drugs and others get given placebos
o = exogenous variation in the area of interest
o Don’t want the assignment of explanatory variable driven by different groups
o Could random people be given iPhones = “treatment group”
▪ See how much sex they had?
10
Anika Madaan
▪ But would be ethical and cost problems

o
o ^ examples of RCTs in economics fields
o There are lots of crazy relationships (but can be explained by story)
- In order to make true relationships (causal) 🡪 need to do actual quantitative data analysis
- MODELLING
- Use mathematical formulas

Xenophobia in UK Example
Since 2010 – UK vilifying foreign citizens
Claim: Foreigners responsible for crime
🡪 Hard to do an RCT for this kind of example because only ONE UK – cannot have a “control”

Periods of time could be: years, seconds, milliseconds

May look at correlation between % of foreigners and % of crime rates in different UK authorities
- Correlation of 0.4333 🡪 not that high (HIGH +ve correlation = 1; HIGH -ve = -1)

- +ve relationship
- Confounding factors: income, population size etc.

- To calculate correlation 🡪 add trendline

Adding a trendline = Modelling the data

^ We have modelled the foreigner crime impact as linear (= simplest case)

11
Anika Madaan
B0 = when there is 0% (e.g., crime rate with 0 foreigners)
B1 = slope; the causal relationship between the two variables (e.g. when foreigners go up 1%, how much
does crime go up?)

🡨 this describes the trendline


(^ y = mx + c 🡪 where mx = y intercept; c = gradient)
y-axis = Crime
B0 = 0% intercept (intercept with y axis)
B1 = gradient/relationship

To truly model the data: need to be more specific

- Every observation, i
- To describe every individual data point, additional variable E for each i
- E measures gap between observation and trendline
o Residual or Error Term
o = everything the first term of the model cannot account for

o
o So the i describes the dataset – an index for the row of data in your table you are dealing
with
o The epsilon E describes the gap between the trendline and the datapoint (residual/error)
o Residual/Error = gap between what the model tells you and what the actual data is

12
Anika Madaan
True Model driving the data vs. Estimate

(should be a little hat over E)

This can be used to uncover the true relationship that is out there
e.g., the true relationship between foreigners and crime

So, the blue line is our guess (our estimate of the true model)
The Green is the true model that is out there.

Why is there a difference between the true model and estimated model????
- So the estimated model is just that – an estimate
o It is estimated based on a sample
o A sample of finite data
▪ If we think about alllll the possible samples that are out there – its infinite (so cannot
get the true value)
o So it is not SYSTEMICALLY WRONG
o It is not right (so by definition it is wrong), but not systemically wrong because if we took
another sample of similar data, we would not make the exact same mistake
▪ Would get something that is slightly different
o E.g. random sample average of dice rolling could be 4.8 but true average is 3.5
o May get a slope that is too high or too low – unlikely to get the exact same (green) slope
again
- There could be confounding factors that make us systematically over- or under- estimate the actual
parameters
o Other things that are driving the presence of foreigners in the area or crimes in the area
o Or other things affecting dice throws
o E.g. what other factors could be driving the data (apart from foreigners being criminals)?

13
Anika Madaan
If we’re considering confounding factors when thinking about the true (green) model – let’s firstly consider
population
When there are higher population 🡪 higher numbers of foreigners 🡪 so E will be bigger
Whereas smaller city 🡪 less population 🡪 less foreigners 🡪 less crime 🡪 E will be lower

^ the Red line above (from dataset) has upward biased relationship (compared with true causal model)

14
Anika Madaan
Confounding factors: let’s secondly consider unemployment

More unemployment 🡪 more crime 🡪 but fewer foreigners 🡪 so E will be POSITIVE


Less unemployment 🡪 less crime 🡪 but more foreigners 🡪 so E will be NEGATIVE

^ so the red line above has downward bias, the slope is too flat, we have underestimated the relationship

So depending on which confounding factor we are considering, we could be over or under estimating the
relationship.

15
Anika Madaan
(1) Data Stories Exercise
1.1 “Study Finds Negative Effects of Police-Worn Body Cameras” - Police body worn cameras increase
violence against the police?

RALF: Would typically expect the opposite, i.e., if people are on camera it is more likely that they can be
prosecuted and punished for attacking a police officer 🡪 But story?
- Maybe there are people who like a reputation as misfits - badge of honour in their peer group?
- Now what could be better for those people than being known and seen on camera for attacking a
police officer?
- Or? Perhaps having a camera changes the behaviour of the officers. They now feel more confident -
indeed perhaps overconfident - to engage with people that are more prone to attack officers.

On the other hand:


- the introduction of cameras let violence against police officers to be better measured?
- Maybe police officers previously did not report all violence against them as doing so would require
them to substantiate their claims which is harder if you don’t have a video record. With that
available reporting went up but not actual incidence of violence.
- This “data story” would imply an upward bias on the actual impact of cameras on violence.

1.2 “Adding guacamole could boost online daters' popularity” 🡪 Guacamole makes you more successful at
online dating?

Maybe:
- Causal? Avocados are more expensive 🡪 more money 🡪 makes you more popular in online dating
- Confounder? Guacamole is quite trendy 🡪 could be an area of common interest for more people 🡪
more popular at dating

RALF: The headline suggests that there is a positive causal effect from mentioning Guacamole to receiving
responses in online dating. A driver for this is could be that Guacamole is a signal for a healthy lifestyle and

16
Anika Madaan
health and fitness are qualities that are desired when looking for a romantic partner (The story of the
article) OR:
- Perhaps younger people are more likely to be into guacamole. But younger people can also be
expected to get more dating responses. This would imply an upward bias of the Guacamole->Dating
success effect.
- A taste for guacamole could be more prevalent among people living in cities. People are more likely
to respond to people that are close by. Hence, people in cities will have more people close by and
therefore receive more responses. Again this implies an upward bias.

1.3 Going to the Museum makes you live longer?

People who go to museums like to take care of their mental and physical wellbeing; mental by stimulating
themselves at museums and physically also 🡪 therefore they live longer (less diseases)

People who have more money and more leisure time are more likely to go to museums; these people are
also more likely to spend more money on gym memberships because they have the time and money to do
so, therefore they live longer and healthier 🡪 upward bias.

RALF: It’s conceivable that visiting a museum calms you down, allows time for reflection and leas to insights
that help you to be more healthy and less stressed and as a consequence makes you live longer.
Equally, museum visits could simply be conflated with other more clearcut factors that make you live
longer. Income and education would be clear candidates.

That said, the study quoted in the article has already accounted for these. But there are other factors that
are not easily controlled for. For instance, having a stressful job with little spar time it probably not good for
both finding time to go the museum as well as for your life expectancy. This mechanism would bias the
museum -> life expectancy effect upward (stressful job is negatively correlated with both museum visits
and life expectancy)

1.4 “Secret to Winning a Nobel Prize? Eat More Chocolate” 🡪 Eating chocolate helps you win a Nobel?

Upward Bias: Extra money on food + education:


🡪 + chocolate
🡪 + Nobel prize

Chocolate makes you happier 🡪 + Nobel?

RALF: A New York doctor wanted to investigate the effect of flavanols, compounds found in chocolate but
also tea and red wine, on cognitive ability. He used the number of Nobels as a convenient proxy for his
outcome variable and country-level chocolate consumption data as his explanatory variable and got a
highly significant positive WHAT ELSE THO?

- It might well be that we have an omitted variable here - wealth - which drives that positive
relationship.
- A richer country (like Switzerland, which has 26 Nobel winners) will have more resources to invest in
research and its affluent citizens might be more likely to be able to treat themselves frequently to
chocolate.
- It might also be that people who study (and are therefore more likely to get a Nobel, of course),
need a sugar fix more often and snack more.

17
Anika Madaan
- In both of these cases we’d have an upwards bias as wealth and time spent studying are positively
correlated with both chocolate consumption and the number of Nobel prizes.

1.5 “Sex Makes You Rich? Why We Keep Saying “Correlation Is Not Causation” Even Though It’s
Annoying” 🡪 Sex makes you rich?

Sex 🡪 Endorphins 🡪 Work harder 🡪 Get rich (upward bias)

RALF: It might be the case that having sex triggers a rush of hormones, which then make you more
productive and give you an edge in the workplace.
- But it is perhaps more plausible (and this is in fact what the original study claims) that sex is another
indicator of health, which is our omitted variable. In this case, we would have an upwards bias.

Another possibility is that there could be a measurement error.


- The data comes from a survey and you should always ask yourself whether it might be the case that
respondents could systematically misrepresent certain information.
- Other studies have found that when it comes to sex, men often over-report how much of it they’re
having while women under-report.
- If the sample was not gender-balanced we may get upwards or downwards bias, depending on
which gender prevails.

1.6 “Does marriage make people happy, or do happy people get married?” Marriage makes you happy?

Marriage 🡪 Endorphins + happy hormones 🡪 Happiness

Upward bias: People who had successful parents who had good marriages are more likely to get married
and are also less likely to have issues in the future so more likely to be happy – upward bias because
successful parental relationships has a positive correlation on both marriage and happiness

RALF: Reverse causality is a tricky nut to crack. This study finds evidence that happier singles opt more
frequently for marriage and that benefits of marriage vary widely among couples – not surprisingly an
equal division of labour at home is an important factor driving happiness in marriage.

1.7 “Do Night Lights Cause Myopia?” 🡪 Sleeping with a night light as a kid makes you blind later?

Kids who have night lights are more likely to be anxious and shy 🡪 anxious and shy kids tend to spend a lot
of time on the internet and computer rather than talking to actual humans 🡪 lots of screen time ruins their
eyesight 🡪 leads to blindness

RALF: Short-sightedness is a growing problem globally - culprit: night-lights. The data, at face value, seemed
to have suggested that kids who sleep with night lights develop myopia later in life. According to the
researchers, the story was that the light makes the eye develop abnormally, which then affects focus.

There could be a more obvious explanation, however, which is that myopia is hereditary – myopic parents
are more likely to install night lights in the house, including their children’s bedrooms, but the light itself
has nothing to do with their kids becoming short-sighted. This is, again, a case of upwards bias as myopia in
parents is positively correlated with the presence of night lights in the house and with myopia in their
children.

18
Anika Madaan
Lecture 2: R
Why R?
- Free
- Open source; Lots of contributors 🡪 Lots of extensions 🡪 new methods of data analysis
- Integration with other programs

Cons?
- Open source; Lots of contributors 🡪 Many different ways of doing the same thing
- E.g., lots of help functions

R vs. RStudio
- R is the computer doing all the calculations and RStudio is the controls around the engine where
you can see what you are doing
- Need both software’s

runif (10) = 10 random numbers


- r = random
- unif = uniform
- e.g., throwing dice

Create variables e.g., v1 = runif (100) (variable 1 will be 100 random numbers)
Plot variables e.g., plot (v1, v2)
df = data.frame(v1,v2) 🡪 the variables will be in a table

19
Anika Madaan
# this is how you add comments – so NOT code

To add data from excel: can import from excel OR read.csv(…)

R markdown = .rmd

Summary(‘variable name here’) = to get summary of variable

Knitting the document: combines everything together 🡪 will formulate the document with all the code put
together
- html document
- can also publish as a webpage

dplyr package – allows for data manipulation

^ make a new variable ‘crimesPc’ which is = to crimes 11 divided by pop11


The %>% is a ‘pipe’ specific to dpylr package

20
Anika Madaan
^ library(ggplot2) then can use ggplot command 🡪 much more sophisticated scatter plot
^^ ff is the dataframe (all our variables)
^^^ aes = the basic parameters you are trying to draw (= aesthetic)
^^^^ + geom_point() 🡺 how to make it points (as opposed to like lines or something)

^ this is the plot, there is a massive outlier at the top though!!

^ make it look nice

NOTE: independent variable = x-axis (the cause); dependant variable = y-axis (the effect)

Zoom 1: R Continued
‘’’{r}
……………………………….
‘’’
^ this is a chunk of code

Regression

So, we always want to find the true line rather than the estimate (the trendline)

21
Anika Madaan
^ we also want to figure out the slope and intercept of this line then we can understand this model

Interpreting estimation results → Always depends on the units of X & Y

Here: A one percentage point increase in the share of foreigners leads to 0.025 more crimes per capita in
a given year

Note: This is not necessarily a statement of fact as it depends on the precision of the estimate and the
possibility of bias. Rather: it is the implication of our estimate if we took it at face value.

Google Definition: Regression = the statistical processes for estimating the relationships between a
dependant variable and an independent variable

^ lm = linear model 🡪 don’t need to put in the whole formula!


^^ you just tell R what is on the Left hand side 🡪 the Y value 🡪 crimesPc
^^^ tell R what is on the Right hand side 🡪 the X variable 🡪 the b_migr11

22
Anika Madaan
The lm command has tried to put a line in the scatter plot

So trendline describes the econometric model of the relationship between the outcome (Y variable) and
the explanatory X variable

Summary(r1) is below:

^ we have intercept estimate = 1.09


^ we have b_migr11 estimate = 0.025

So, what does 0.025 actually mean???


- It is the gradient of the line (foreigners vs. crimes scatter plot)
- If we change the foreigner’s variable by one unit (up by 1% point) – then the B1 tells us what will
happen to the number of crimes
o So, crimes will go up 0.025 crimes more per capita

23
Anika Madaan
^ r1 is a list of variables, one of which are coefficients – which are the parameters which have been
calculated
Beta0 = the intercept
Beta1 = the slope

So now we have defined the model – and fitted it to our data = estimated

OLS Algorithm
R finds the estimates of β0 and β1 by minimising the sum of squared residual (hence least squares)

^ ^
β0 = 𝑀𝑒𝑎𝑛(𝑌) − β1𝑀𝑒𝑎𝑛(𝑋)

The lil ^ hat arrows = estimates

It is squared because we care about how close we are to the points 🡪 the bigger the arrows are, the further
we are from the points
- But we don’t care whether we are away on the +ve or -ve side
- So SQUARE

-
- For every observation – we square it, calculate the sum and see how big it is
- Across all the observations in our dataset
- And we would like for it to be not so big
o This means that our line will be not too far away from these points

-
- ^ so, B1 estimate turns out to be, the covariance between X and Y variables, divided by variance of X
- So, if you have 2 variables that are +vely correlated 🡪 +ve covariance 🡪 then you get a +ve slope
- But if -ve covariance 🡪 -ve slope

Can then work out B0 = mean of Y – (mean of X times B1 estimate from ^)


^ ^
β0 = 𝑀𝑒𝑎𝑛(𝑌) − β1𝑀𝑒𝑎𝑛(𝑋)

So, OLS allows us to calculate linear regression!

Say if there was no relationship and B1 was 0 then B0 would just be mean Y – because mean X times 0 is 0

24
Anika Madaan
Residuals = E
For every observation – there are residuals 🡪 new data frame called residuals ^^^

^ cor = correlation matrix – select picks the selective ones so ‘residuals’ and ‘b_migr11’
You can see that residuals correlated with residuals = 1 (they are the same no.)
The other way round – is so small 🡪 is virtually 0
Why is that interesting?
- The residuals are absolutely unrelated to the X variables

OLS = Ordinary Least Squares


- Least squares = least variations
- Ordinary = it is one type

*** Online explanation: “the idea of Simple Linear Regression is finding those parameters α and β for
which the error term is minimized. To be more precise, the model will minimize the squared errors:
indeed, we do not want our positive errors to be compensated by the negative ones, since they are equally
penalizing for our model” ***

OLS on YouTube:
Residuals = Ei (so the difference between real Y (on the trendline) and estimated Y (the observed data set))
- e.g. -ve residual is when dot is below the trendline

25
Anika Madaan
((Least squares estimates the unknown values of the parameters B0, B1 in the regression function

OLS should be UNBIASED so should be no difference between B(hat) and B(true) = 0


OLS should have LEAST VARIANCE
OLS should be LINEAR ESTIMATOR

Yi = B0 + B1Xi + Ei ))

So if we wanna figure out how given data measures up to the trendline – we wanna minimise ALL the
residuals! – so, a combination of all residuals?
- BUT
- This is tricky as some are +ve (above trendline) and some are -ve (below trendline)
- Hence, we add up the SQUARES of all residuals – this reduces the -ve and +ve
o Also useful because anything with larger residual – will become even larger

Key R Commands for Data

^ Types of merging or joining data

26
Anika Madaan
Defining a function:

Loops:

In any programming software – have some form of loops to REPEAT anything 🡪 avoids writing a command
over and over again
- for {} – anything between those brackets is what is going to run over and over
- so first we have created ‘regions’ which is the list that you want to work with

27
Anika Madaan
- %>% unique – prevents repeats
- *** remember ‘plotter’ and ‘inner’ need to already be coded into the memory before the above can be
run ***
- so, need to run ALL CHUNKS OF CODE above first
- then the loop will work

(2) R Exercise
To load a dataset with R command?
- Auto <- read.csv (“C:/file_path_way/auto.csv”, header = TRUE)
- header=TRUE 🡪 column names will be the headers
- to find pathway directory:
o getwd()

Number of observations? 🡪 nrow(auto)


Number of variables? 🡪 ncol(auto)

Installing packages?
- Install.packages(“dplyr”)
- Libarary(dplyr)

We can see for area, there is no max or min area because it is a discrete variable (character string)

Q1= The middle value that falls between the smallest number in data set and the median
Q3= The middle value that falls between the median and the largest number in the data set.

PIPING?
- Piping allows you to send the result from one R command to the next.
- E.g. auto %>% summary()
- ^ is the same is summary(auto)

Max Command?
covid_july %>% filter(deathsOcases==max(deathsOcases))

Group together command?


usdaily=covid %>% group_by(date) %>% summarise(deaths=sum(deaths),cases=sum(cases))
head(usdaily)

US Death Rate over time?

28
Anika Madaan
^ Now what do we expect to find? To die from COVID you have to get sick and it will take at least a couple
of days. Hence, we would expect to see death rates to be low at first and then to increase.
- Over time we would hope that doctors get better at treating COVID patients.
- Also, people who know they are at risk will increasingly shielding so that the only people who get
sick will be those who are less likely to die.
- Both factors should bring the death rate down after an initial peak.
- This is what we are seeing when only looking at the series from the end of March onwards with a
peak in mid May.
- However, there is a first peak in early March. What could explain this?
- The most plausible explanation is probably measurement error: early on cases of infections where
probably not counted properly as there was no systematic way of testing the population. On the
other, if somebody was so ill that they die this would almost certainly be picked up by the
authorities.
- Hence death figures were not under counted.
- Also, early on numbers were low.
- So a couple of counts too few for cases could easily swing the ratio

Lecture 3: Visions
Objective: Be able to use diagrams & visualisations to represent data on R – are the best way of showing
data and telling a story.

29
Anika Madaan
Example: Soho Cholera Outbreak

^ most of the deaths concentrated around the pump – good example of visualisation; tells the right data

What can go wrong?


- Can deliberately mislead people
- E.g., pretending prices have gone down when they haven’t by showing the diagram that it has gone
down but the number is going up
- E.g., Covid – state of Georgia, US – map seems to show COVID situation hasn’t gotten much worse
because the same areas are the same colour – but what the colour means on each graph has
changed – it means a much higher number of cases later
o Unsure if misleading or incompetence
- E.g., pretending something is a bigger deal than what it is such as a growth graph showing a big
difference between figures but actually the number is really small
o E.g., don’t start the scale at 0 but much higher
- If you use the wrong diagram to show data, such as to show parity between males and females
specially to show + and – numbers
- If you just use flashy diagrams which is harder to interpret

🡪 ggplot is an allround package that can make many types of visualisations

Example: Covid Hoaxism


🡪 Lots of people think COVID is a hoax – lots of tweets relating to this
🡪 Can collect number of tweets from each US state, no. of hoax tweets, COVID cases, deaths
🡪 Hypothesis: States with more COVID hoax tweets will not be following hygiene and rules so more cases
and deaths
🡪 Have to think about confounding factors as well e.g., access to computers so that people can actually
tweet, more rural states where COVID is not so bad anyways

30
Anika Madaan
Scatter Plots

^ basic scatter plot


🡪 can examine in our research if the effect of hoaxism is stronger when population density is higher. For
that we group states into quartiles of population density

^ clearly a relationship between density and deaths


🡪 can overlay multiple plots for different categories (such as quartiles of density for instance)

31
Anika Madaan
^ so by adding the density of states, we can see that hoax tweets are more closely linked with deaths in
more densely populated states 🡪 but the relationship is not monotone (slope is flatter in top quartile than
2nd and 3rd quartiles)
🡪 all we needed to add to the simple scatter plot command is the color=dens_quart argument as part of
the plot aesthetic

Time Series
🡪 We might ask if hoaxism has died down as the crisis progressed – using day by day data
- Would be quite long; how to shorten?
- Aggregate across states

🡪 ‘Lubridate’ function 🡪 to tell R you are working with dates

^ put data into DATE format

32
Anika Madaan
^ make DATE into graph: time series

^ to differentiate months further (because we can’t see March for example)

^ same thing but can see what happened in each month of 2020 and can do the same etc. with like weeks
etc.
usbyday=usbyday %>% mutate(week=round_date( date, unit = "week"))

^ creating new variable: ‘week’


- Can then get the same graph but in weeks
-

33
Anika Madaan
< might want to run regression on this data with the number of deaths
to see if there is a relationship between Hoaxism tweets and deaths – but with time series, can overlay the
death time series alongside!

< both axes now!

34
Anika Madaan
So what is the story here?
It is interesting to see that every major wave of deaths was preceeded by a flare up of hoaxism a couple of
weeks earlier
- (e.g. in February we had a flare up of hoaxism followed by a spike in deaths in April. Then in May
hoaxism was strong again followed by a death spike in late July. Worryingly in early September we
are seeing another spike in hoaxism).
Of course, this has to be taken with caution. A different story could be that the death spikes cause hoaxism
(Although the first hoaxism spike could not fully be explained by that). This could be due to a phenomenon
we also sometimes see with religious believes:
- if delusional beliefs (e.g. an apolcalyptic prophecy not being true) are challenged by reality (e.g. by a
spike in people actually dying) in some cases the delluded rally even more closely around their
dellusional believes because the cost of stopping believing have now increased.
- For instance in the COVID case you now not only have the embarrasment of having believed
something silly but you might have to accept responsibility for behaviour that killed others, maybe
even loved ones.

Bar Charts
Can see if Hoaxism is more intense on some weekdays?

< Ta da

35
Anika Madaan
< tidied up and ‘fill’ gives different bars different colours
🡪 seems hoaxers are particularly active on Sundays

Histograms
🡪 Allows us to look at the distribution of a variable across a sample. E.g., we can look at the daily hoax
shares
ggplot(data=usbyday, aes(hoaxsh*100)) + geom_histogram() + ylab("Number of
days")+xlab("Hoaxshare in %")

< What’s the story?


The hoaxshare distribution is quite left skewed: most days it’s lower than 1%. However there are some
outlier days with sometimes more than 6% of tweets of a hoxist nature

Density
🡪 Often preferable to histogram - multiply the density with the width of a histogram bin you get the share
of observations (as opposed to the count) that fall into a particular category
ggplot(data=usbyday, aes(hoaxsh*100)) + geom_histogram(aes(y=..density..
,fill=..density..)) +
ylab("Density") +
xlab("Hoaxshare in %") +theme_minimal() + geom_density()

36
Anika Madaan
Maps
One of best ways to visualise data

^ To convey data (e.g. the share of hoxism) we can use a heat map

37
Anika Madaan
D3
D3 is powerful javascript library to make interactive web based figures and visualisations. It’s particularly
cool for visualising networks. The R package networkD3 provides a simple interface to make some of the
functionality available in R.
- E.g. we can create flow diagrams (also known as Sankey Networks)

Lecture 4: Testing Times


Objective: doing statistical tests and understanding uncertainty when doing tests; understand the reliability
of a regression result assuming there is no bias (Dealing with issues where we have solved the endogeneity
problem) or misspecification of the model
(= known unknowns)

How would you decide if a dice is fair?


- Physical characteristics – area/shape of each size
- Throw the dice multiple times and see if each outcome is dropped an equal no. of times
o But would you ever expect an EQUAL dropping?
o Will not happen, even with a fair dice
- We are looking for extreme cases e.g., 100 6’s and no other number
- Can never 100% be sure if dice is fair
- See what is very unlikely so can conclude the dice is rigged
So, if we have a model:

Where more foreigners = more crime 🡪 is this relationship strong enough to confirm this relationship?
🡪 Could be that there is no relationship at all?
🡪 At some point, it becomes unlikely that there is no relationship e.g., the steeper a slope gets
🡪 Need to determine how certain these estimates are

Monte Carlo Experiment


^ name about a casino; high stakes games with known probabilities e.g., roulettes etc. BUT using R instead

38
Anika Madaan
< R code making our own data; does not look like
there is a relationship between the 2 variables – which makes sense because it was random data
- And the B1 = 0 so again would expect no relationship

BUT, when we run regression on the above data, we get: -0.81 – huh? How? Isn’t that really big?
- Let’s do the whole thing many times:

-
- So, this is what we would expect from no relationship

Histogram & Density


🡪 Could also plot a histogram
🡪 Histogram counts all values you have in your data and plots how much each value was found

39
Anika Madaan
If the data was continuous – it would make the histogram into bins, like above bins of 0.5
Density 🡪 re-scaling of the first frequency graph – multiply each bar by a number so the interpretation
changes slightly
- Then the area of the bar matters more
- So density shows % vs. number

^ made the bins much smaller here


- Helps us to see the bell-curve here
- Is a bit unclear – goes up and down a bit but the small bins show us the overall

40
Anika Madaan
- The blue line shows the bell shape of THIS data and graph – smoothed line of THIS histogram
- Orange line shows actual formula – normal distribution
- The more times you do it, you see the blue curve get closer to the orange curve

Now we can see how our estimate is distributed, so anything that is in the tails would be the EXTREME or
the UNLIKELY

Changing the Curves

Large vs. Small Sample

^ black line is baseline (the original sample); blue line is what happens with a much smaller dataset
🡪 much flatter curve; so probability of being in the middle has gone down
🡪 probability of being in the tails is much higher; certainty has gone down
🡪 going away from the true value – probability is much higher – is much harder to pick out the TRUE value
🡪 few data points = uncertain true value

Thinking about our model:


- We used the certain standard deviation for the above graph
- What if we change epsilon = change standard deviation?
- So, epsilon with higher standard deviation 🡪 more uncertainty with the epsilon – it could potentially
be much larger and further away from 0/TRUE value

Variation in Epsilon

^ changing epsilon: the range is much larger; but blue curve is still flat
What does this show?
- So, the X obviously has an effect on Y but the epsilon also (which has nothing to do with X) has an
effect on Y
41
Anika Madaan
- If there is much more stuff going on in the epsilon/residual – it will be more uncertain trying to
estimate Beta and may be further away from TRUE

What about: Dispersed X vs. not so dispersed X?

^ Density of x – really weird curve, not even a curve really; x that varies a lot more
^^ Graph 2: shows when there is a bigger difference with the X values – it is much easier to figure out what
the effect is 🡪 the Beta values are much clearer
E.G., if you look at effect of GDP on suicide? If there are small differences in GDP, may not clearly see effect
on suicide but if you have a really BIG drop in GDP there will be a much more clear, obvious effect on
suicide
🡪 SO, when the X varies a lot more, the Beta becomes clearer

Non-normal Epsilon

^ here the distribution is much weirder = not NORMAL


When the epsilon is distributed like above ^, it is because instead of epsilon being any random number, is
not can only be -1, 1 and 10 (as can be seen) and the probability distribution is very different

Central Limit Theorem

42
Anika Madaan
^ From those NON-NORMAL EPSILONS, it shows that it doesn’t matter – the blue line is still normally
distributed
- So, no matter what epsilon is, we will end up with a normal distribution
- = central limit theorem
o If you take any random variable and calculate average and sample it various times – you will
end up with something normally distributed
o (but not always the case – as shown on the LEFT; there is a smaller sample size there – just
10 samples)
o Thus, need to make sure the number of observations is big enough

The Variance of the Estimator


There are different kinds of normal distributions e.g., can be centred around different numbers e.g., 0, 1 or
10 etc.
- Can be flat or steep

Can work out variance using ^ formula


- Sigma Epsilon squared = variance of epsilon
- Divided by nVAR(X) = n is how big your sample is; variance of X
- So there will be a wider, flatter bell shaped curve when the epsilon variance is high (a lot of other
things apart from X that is affecting our output; hard to get a very precise estimate, high variance)
- When do we get a more precise value? When n is bigger; or if variance of X is bigger (so bigger
changes in X – (GDP))

Recap:

43
Anika Madaan
- Regression estimates are (approx.) normally distributed
- We can work out the variance
- Normal distribution is fully characterised by standard error and mean
- To work out the likelihood of that a value of a particular value arises, we can work out the area
under the density
- We can define how much risk of being wrong we are willing to accept and then work out a critical
threshold (= significance level)

< P: probability that we have


a value that is even larger or smaller than 0.03 (estimate for b_migr11; even larger or further away from 0
than 0.03 already is)
- So, 1.23e-12 is a very SMALL number so probability of finding something bigger/smaller – so our
coefficient is quite accurate
- The fact that this number is so small helps us to reject the hypothesis that there is NO relationship –
we can assume there is a POSITIVE RELATIONSHIP

P-Value

< the p-value tells us an area of probability


that we are getting a value more extreme than the 0.037; tells us how likely is that we get an estimate that
is smaller or that is further away from 0 than the estimated value
- In many cases, the p-value is much higher and not so close to 0
- When do you say you should reject it? No clear-cut rule but if it is higher than 50%, it would be silly
to reject
- Need to think about thresholds:
44
Anika Madaan
Significance Levels

If there is a value closer to zero – that is c; anything above c – REJECT


Do we mind being wrong in 10% of cases? 5% of cases? Could people die? What are the stakes?
🡪 A common level in literature = 1%, 5% or 10%

< How do you find these values???


🡪 function = cumulative normal distribution = gives you area under normal distribution from -infinity to a
particular value
🡪 but we don’t want to know the area under the curve – we know the probability; we want to know the
value that goes along with it
🡪 so, need qnorm(0.025) 🡨 0.025 because 2.5% because 5% and half on either tail

< That’s how we got the values on the graph


- The higher the significance level, the smaller the threshold
- Higher significance level means we are less worried about an error of type 1
- Hence we are happy to reject in more cases

T-Statistic

< dividing the estimate of Beta by its standard error


So even if the B is not standard normally distributed, but if you divide it by sigma to get t – then you can
use the qnorm function and find the critical values
- E.g., now we know the critical value for 5% is 1.95; for 10% is 1.64; for 1% is 2.58

45
Anika Madaan
T statistic is how far away the estimate is away from its hypothesised mean
You test the t statistic with the t distribution

^ So even though the data may not be standard normally distributed, by calculating the t-value, we can
compare it to a standard normal critical value
🡪 7.4>1.96 so we can reject the hypothesis

Zoom 2: Tests Continued


So, why do we have tests?
🡪 Because even with very precise estimates, we can rarely find a true model so need to have an idea of how
far we can be from the true model
🡪 And when we can say that a coefficient on migration has no effect on outcome variable – crime
🡪 When is it okay to say no relationship? 0.037? or is it too small?
🡪 In order to say if it is a big or small coefficient – we need to know the likelihood of making a mistake when
rejecting the hypothesis – can do this by figuring out the THRESHOLD we will use
- This is contingent on distribution (normal distribution)
- But can convert anything to normal distribution by dividing coefficient by standard error
o = t-value
- Can decide using p-values
🡪 Usually, we want to reject the hypothesis; if we cannot reject the hypothesis then the variable is not
explaining anything
🡪 If we have excluded endogeneity, then it is not a problem if the hypothesis is rejected – it could simply be
the answer that a relationship is negative

46
Anika Madaan
More or Less Significant Estimates
• If we have a lower significance level (e.g., 1%) we are less likely to reject a hypothesis
• Is always harder to reject at a lower level because we don’t want to risk so much to be
wrong
• This is to avoid making the Type I error
^
• If we still reject the β=0 on the basis of an estimate β we say that the estimate is highly significant
• If we would only reject the hypothesis with a much higher significance level (e.g. 10% instead of 5%)
we say that the estimate is only weakly significant
• Have to be cautious – might be wrong

Another Example

< Job Experience vs. Earnings


and want to see if experience has a relationship with job earnings – run regression
- EXP = years of being in job
- EARN = hourly earnings in USD
- So, one more year of experience 🡪 0.24 increase in dependant variable (EARN) 🡪 24 cents more
- Is this statistically significant though?
- At 1%: 1.73 < 2.58
- At 5%: 1.73 < 1.96
- At 10%: 1.73 > 1.64
- So, we would reject it at 10% but would accept it at 5%
- The p-value = how likely it is that we have results that are more extreme = 8%
- 8% > 5% so we wouldn’t reject it at 5%
- But 8% < 10% so we would reject at 10%
o SAME answer as t-values
- *** shows significance (see ‘Signif. Codes:’)

Easy & Hard Estimation

47
Anika Madaan
< true data is green line
- Simulated epsilon shots BLUE: quite linear – very similar to the true green line; allowing for very
little spread in epsilon (so very big spread in beta)
- Epsilon RED: come up with almost anything – very large spread in epsilon so very small spread in
beta

More General Hypothesis Tests

Testing Beta = 0 is the most common test but there could be other areas of interest

48
Anika Madaan
Here we are having Beta as -1 (because the hypothesis is that when you lose a year of school, you will lose
a year of experience)
- Any test can be expressed with thresholds – so with t-value, we can simply use the formula but for
the p-value, need a command
- library(“car”)
- linearHypothesis(mod_earn_exp, c( “S - -1”))
- ^ S is the coefficient associated with S variable (school)

-
- P-value = very small and different to the p-value in the original ouputs ^^^

Significance vs. Bias


- They are separate
o An estimate can be significant and biased
o Or non-significant and non-biased (or vice versa)
- We don’t prefer one over another because one is significant
- Need to ask underlying reasons why one estimate is significant and other is not
- It is clearly better to have NO BIAS because even if it is significant – it may not be correct
o The whole reason why something may be significant may be due to the bias

49
Anika Madaan
Estimation of Variance of Beta

- The variance of X is easily observed


- But how do we get the variance of epsilon?
o Need to estimate from Epsilon HAT

- The differences in these ^ t-distributions are the normal middle is a bit lower, but the tail ends are a
bit higher
- So, the likelihood of getting something far away from the mean/true value is higher than with the
normal distribution
o This is because you have to use standard error of BETA in t-value calculation
o Standard error of BETA is already an estimate in itself
o So greater UNCERTAINTY in your estimate
- There are also multiple t-distributions. Why?
- Depends on degrees of freedom = how many observations – no. of parameters needed to estimate
- So, in this case: observations – 2
- But sometimes have more parameters and more complex models so need to subtract more from
the degrees of freedom
- ^^^ The higher the degrees of freedom, the closer the distribution is to the normal
o So, having a really high number of observations means you don’t have to worry too much

50
Anika Madaan
Critical Values T-distribution

So calculating the critical value ^^^, need the degrees of freedom – here we have used 10 (maybe we had
12 observations and 2 parameters to estimate, beta 0 and beta 1)
- The number is then -2.23
- This number is quite different from 1.96
- The more degrees of freedom you have, the smaller the difference between the 2 numbers get

- e.g., when we did 1000, it was virtually the same number (1.96)

To work out critical values of the t distribution we use:


qt(0.025, 58)

this is for a 95% confidence interval with 58 degrees of freedom


degrees of freedom= observations- number of parameters to be estimated.

(4) Testing Exercise

QS: >
The t-value (rounded up to 3 decimal places) is?
Can she reject H0 at at the 5% significance level?
Can she reject H0 at at the 1% significance level?

ANSWERS
-2.857
Yes
51
Anika Madaan
Yes

NOTE: I used the rejection threshold values for normal distribution (are 1.959964 and 2.5758293 for a 5%
and 1% significance level). For the t distribution the same values are 2.0017175 and 2.663287 (Note that
we have 58 degrees of freedom)

We can also work out p-values - Remember that the p-value is the probability - assuming the H0 is correct -
to have a value more extreme (further away from 0) than the one estimated.
- We can use the pt() command for that, which gives us the cumulative density function of the
t-distribution; e.g. pt(0,58) gives us the probability to have a value smaller than 0.
- Because the t-distribution is symmetric that will always be equal to 0.5
- Note that we have 58 degrees of freedom here as we 60 observations and we need to estimate 2
parameters (β0 and β1).
- To work out the p-value call the pt() function with the t-value for a given estimate; e.g. for part a)
- pt(-.2/0.07,58)
- [1] 0.002962872
- i.e. the probability of having an estimate lower than -.2 is 0.0029629.
- Note, because the our test considers the possibility of being too low and too high (and because the
distribution is symmetric) we need to double this to get the actual p-value which becomes
0.0059257.
- However, this is still below 1% so we can safely reject the hypothesis that the true parameter is 0.

Lecture 5: Multivariate Regression


Taking back control of confounding factors (that cause endogeneity) 🡪 CONTROL VARIABLES

Objectives:
- Multivariate regression – further control variables in regression
- Understand that more variables is not always better
- Learn hypothesis tests on several parameters

🡨 Wage regression:
- Can you expect a higher wage after studying for
longer
- But here we capture GENDER too
o But we cannot code qualitative things
e.g., gender, which school
o But we can set it to equal 1

3. The precision of any estimate that we are


calculating depends on a number of things including the
variance of the residual (u or Epsilon) – variance of Beta
hat depends on variance of residual; so we are taking the
B3FEMALE out of the residual (u) so the residual is
becoming smaller – so taking away some of the variance
of the residual so a more PRECISE estimate of BETA

🡨 To get a bias, we need some correlation


between the external confounding factor and
the variable
52
Anika Madaan
- Need a story for why it is both driving the dependant variable and is related to the explanatory
variable
- E.g., Me2 movement – shows why gender affects wages
- But also need to know that women affects education
- So there is a correlation between the residual and dependant variable (wage)
- How does this affect estimate?
o Will cause a bias
o Upward bias
▪ Normally if there is no relationship between education and Epsilon – get NO bias

▪ If there is +ve correlation, upward bias

▪ If -ve correlation, downward bias

▪ But doesn’t just depend on +ve or -ve correlation – but also whether what is missing
has a +ve or -ve effect on DEPENDANT variable
▪ Women = -ve relationship with education (so downward bias right?)

▪ BUT ALSO women = -ve relationship with wage

▪ So double negative = +ve 🡪 spurious positive relationship

▪ So upward bias

▪ Also – the more education you get in your dataset; the less likely you are to find
females in your dataset – so again will be driving up wages
- So how do you get rid of this bias?

-
Red = regression line (too steep)
Green = true line; much flatter (so
upward bias)

🡪 Can rationalise with epsilons because


when years of education = low,
epsilons = low as well (-ve)
🡪 More education years = bigger +ve
epsilons
🡪 could be coming from gender
🡪 square = women; diamond = men
🡪 you can see women with the same
years of education has lower wages

53
Anika Madaan
(but also slightly to the left; highest educated person is man)
🡪 can also see dashed lines which are the extended model – not really true but there are two groups and
one group has a higher line and the other has a lower line but the intercepts are different – but the SLOPEs
are the SAME (just slightly lower)
🡪 Can read the coefficient of gender as the DIFFERENCE in INTERCEPTS = B3

Causality vs. All Else Equal


Another variable that is likely correlated with schooling and affecting wages is experience (EXPER) –
potentially strong correlation so want to exclude it as control variable
🡪 𝑊𝑎𝑔𝑒 = β1 + β2𝐸𝐷𝑈𝐶 + β3𝐸𝑋𝑃𝐸𝑅 + 𝑢
- If we run a regression of this equation (and EDUC and EXPER are independent of u) the estimate of
β2 gives us the change Wage for one year more of schooling keeping experience (and everything
else constant)
- However, it might not give us the causal effect of increasing schooling on wages.

🡨 by including experience as a control variable – you are


shutting down that part of the causal relationship – this is interesting because if you are trying to make
some changes, you might want to know about the negative effects as well
- We don’t just want to know the artificial effect – we want the physical reality that if you spend time
on education, you can’t spend that time on experience – so you might actually want to INCLUDE IT
- The reason why EDUC and EXPER are correlated is likely because of a chain of causality from
schooling to experience (i.e. if you go to school longer you don’t have so much time to get job
experience; also not that S is typically determined before EXPER which supports the suggested chain)
- If you include EXPER as separate explanatory variable then your coefficient on EDUC will not reflect
this causal channel. This is good if you really want the all else equal effect of EXPER. However, if you
want the full causal effect of EDUC (e.g. you want to advise the government what an extra year of
schooling does to wages) you get the wrong answer as you are pretending that you can have extra
schooling without reducing people’s experience. So it would be better to exclude EXPER.
- DON’T MAKE IT A CONTROL VARIABLE

🡨 So gender was driving both education and wages rather


than education driving both education and wages
- Gender is mostly (but not exclusively) determined before schooling
- Hence the reason why EDUC and Female variable are correlated because of a causality chain from
Female to EDUC.
- In this case it is vital to include the Female variable to get the correct causal estimate of a change in
EDUC
- In this case we MAKE IT A CONTROL VARIABLE

54
Anika Madaan
🡨 not always clear cut; so here one
variable is causing experience and therefore causing wages but there might be other factors going on too
e.g., maybe what is driving how much education you have is actually how much experience you’ve gotten
(motivating you do get more education?)
- Then how do you know whether to make it a control or not???
- Say we are only interested in the effect of education but are worried about the confounding factor
experience – we can run a regression with and without experience and see if education co-efficient
is more or less the same? Then know that it won’t make a difference
- If it does change a lot, then we know we cannot make strong conclusions with the current data and
need more research or other strategies e.g., instrumental variables or if you find other data about
schooling and vocational training
- If the causality between the two explanatory variables goes both ways we are in trouble as far as
finding the causal effect of EDUC is concerned (we are cool for finding the ceteris paribus effect).
Both including or dropping the gender variable will lead to a biased estimate. We have to use other
methods some of which we shall discuss later in the module. INSTRUMENTAL VARIABLES IF THERE
IS REVERSE CAUSALITY.

NOTE: so key is – you can be TOO CONTROLLING!


• More control variables are not always better to identify a causal effect
• To include or not include → depends on direction of causation between control and x var of interest
• Sometimes there is no clear-cut answer as causation goes both ways

^ so we find that coefficient = 0.54


- So, 1 year of education = 54 cents more hourly wages
- Significant because small p-value and high t-value

55
Anika Madaan
^ so one more year of education = 1.4 years LESS experience

So – what is the effect of including EXPER on the EDUC coefficient?

🡨 So, education + wages = +ve relationship


- If you take out EXPER – then you remove a NEGATIVE part of that +ve relationship
- Therefore, the education effect (without the less experience effect) gets stronger = GOES UP

- 🡨 gone from
0.54 to 0.64

Now, looking at gender…

56
Anika Madaan
🡨 more female = less education (not super significant
(0.05 p value) but still fairly significant = 0.5 of a year less education on average

^ makes the education co-efficient smaller! - 0.54 to 0.51


- So, what we include depends on what the objective of the study is – which effect we want to
include e.g., there may be a story behind females and wages e.g., females are forced to get married
and learn less

More than 2 Variables

🡨 can do more than 2 variables in R with a “+”

57
Anika Madaan
Perfect Multi-collinearity

^ once we have more variables – we don’t just worry about X and Y but also causal relationships with extra
control variables
- Perfect multi-collinearity = how correlated the various control variables are
- When one control variable goes up, the other goes up (or down) to the exact same degree
- Simplest way to generate this, is to create a new variable e.g., EDUC_in_days = EDUC*365
o Looks like very different numbers but they are proportional (see cor() = 1)

- 🡨 N/A for
educ_in_days because is the same as EDUC (in years)
- Could be that a student may drop out halfway through a year and Educ in days would be different to
Educ in years – then wouldn’t have perfect correlation
- So R identifies that two variables in dataset have a perfect linear relationship so they will drop one
of the variables
58
Anika Madaan
- (the coefficient would be different to EDUC in years but what it represents would be the SAME
because now the unit doesn’t represent the change in one year, it is the change in one day – same
thing as saying salary in dollars vs pounds – number will change but not actually earning any more)

Accounting for variation: R2

🡨 R2 allows us to see the ratio between


the variation of Y and variation of Y Hat – the picture illustrates low R2 (the bit in red where there is a LOT
of variation in Y that is not explained by any variation in the model because the epsilons are very high)
- High R2 (blue) – the variation in Y is almost perfectly explained by the variation in the model (so X
and B1)
- Could have a very high R2 with very biased coefficients – so two things may be correlated by it’s not
really explaining anything – so be CAUTIOUS
- Internet def: R^2 is a statistical measure that represents the proportion of the variance for a
dependant variable that is explained by an independent variable in a regression model 🡪 so
correlation explains the strength of a relationship between dependant and independent variables,
R^2 explains to what extent variance of 1 variable explains the variance of the 2nd variable
o E.g., if the R2 of a model is 0.50, then approximately half of the observed variation can be
explained by the model's inputs

R2 = 100
- Can make R2 artificially high

*** vice-versa if you make up underlying variables, your R2 will go down!

59
Anika Madaan
Finding R2
• Accounting is not necessarily explaining
2
• 𝑅 is mechanically increasing as we add further variables
2
• If we have as many parameters as observations 𝑅 is always 100% (e.g. consider 2 observations)
2 2
Hence Adjusted 𝑅 = 1 − ( 𝑛−(𝑘+1
1−𝑅 )(𝑛−1)
• )
where k=Number of variables
• i.e. the higher k the lower 𝑅
• ALWAYS REPORTED IN REGRESSION


• So if you include as many variables as you have observations 🡪 the DENOMINATOR becomes very
SMALL 🡪 so ratio becomes BIG
• ^^^ the R^2 is quite small because >500 observations and only 3 variables
• High R^2 🡪 (not as important for causality) useful for predictions

Imperfect Multi-Collinearity
🡪 Explanatory variables are closely but no perfectly correlated – CLOSE relationship but not exactly PERFECT
🡪 Consequences:
• We can estimate all coefficients – so will still get output in R (instead of NA)
• Variance of estimates might be high i.e. estimates could be quite far off from true value.
• However: estimates will be unbiased (if x not correlated with ϵ)
🡪 Can do VIF (variance inflation factor) command to see how much its inflated
🡪 May also be lots insignificant effects (but you know that JOINTLY it matters)

Joint Hypothesis Test – F-tests


Combining explanatory variables together to see that if they are significant when grouped together.
LinearHypothesis() command
F-tests: we are testing specific hypotheses on the different explanatory variables in the regression (= a
general way to test several variables and their impact on the model; e.g., here seeing if they are jointly
equal to zero or jointly significant)

60
Anika Madaan
🡨 can see he has grouped the ages
together (because separately regression showed that they were insignificant) 🡪 but TOGETHER = significant

(5) Bias Exercise


Dataset of white and black CVs:
1) Do gender and computer skills look balanced – i.e. random - across race groups? YES

2) Education & No. of jobs balanced across race? Do regression to be sure?? YES

61
Anika Madaan
🡨 p-values show no significance

3) What do you make of these results (mean and SD are also the same)?
- If there was any evidence of a systematic relationship between race and any of those characteristics
we could potentially be in trouble when simply comparing interview call backs for different race
groups. Any differences found could simply be due to those other factors rather than racial bias by
employers.
- In the data, there seems to be a clear difference between the races. Whereas for white people call
back rates where above average (9.65%) they were below average for black people (6.45%)
suggesting a racial bias by employers.

4) Linear regression = predict the continuous dependent variable using a given set of independent variables
(mod <- lm(call~black,data))

Logit regression = predict the categorical dependent variable using a given set of independent variables
(mod2 <- glm(call~black,data,family=binomial))

Marginal effects of using logit = very similar to linear model.

5) Are black people significantly less likely to be employed? And what is the % difference?

🡨 YES and 9.1%

6) On the basis of your evidence what can you conclude about racial discrimination in the US labor market?
Do we see a open-and-shut case of racial bias here? Yes No

62
Anika Madaan
Think of the potential caveats and alternative explanations. What analysis could you undertake to address
some of these caveats?
- We see that people with black background tend to have less college education and college
education is another major driver of being employed
- Hence, far from implying a racial issue, the result in (b) could simply reflect employer’s preference
for more highly educated workers. We can examine this by doing the analysis in (b) separately for
workers with different educational attainment
- The results suggest that for either group there is a significant racial gap when it comes to being
employed. Note that the effect is considerably stronger for less educated workers.
o Hence this re-enforces the hypothesis that there is discrimination against workers with black
background which is un-related to their productivity in the workplace.
- However, there might be further caveats: our simple regression cannot account for the quality of
the college education which can vary considerably and might vary systematically along racial lines.
- Furthermore, an important driver for a good education and for various other skills might have to do
with parental income and status.
o Again, this is likely to vary systematically along racial lines.
- While it is interesting to ask if employers discriminate above and beyond what could be expected on
the basis of education and skill of workers – which is what we were implicitly doing above - we
might also be concerned about the overall impact of racial background on labour market outcomes
which includes initially different educational outcomes.
- Hence, depending on our interest we might be primarily focused on the effect of race holding
education fixed or we might be focused on the overall effect.

(rest of exercise to complete)

Lecture 6: Econometrics for Dummies


Dealing with dummy variables = binary variable useful for dealing with qualitative data (this is important
e.g., gender, recession etc.)
🡪 Is one way of representing non-linear relationships in data
🡪 Trendlines implies a relationship between variables (linear) but dummy variables allows for other ways of
modelling relationships

Looking at Wage Regression Data:


Female coefficient estimate = -2.5
- So being a woman = 2.5 dollars per hour less wages than somebody who identifies as a man

63
Anika Madaan
🡨 Intercept is 7.1
^ Can see that when women = 0 (so a man) – it is 7.1 (intercept)
But when you add 1 (woman) 🡪 it becomes 4.59 (-2.5)
🡪 Conditional Expectation = another way of expressing it; so the average wage for women (conditional on
something else)
- So if someone is a woman, you would expect a wage of 4.59
- SO the average WAGE of woman = MAN’s WAGE + WOMAN coefficient
- So 7.1 + (-)2.51 = 4.59

Dummies as bars

^ another way of visualising


X and Y axis – non-linear step function (rather than a line)
So as you move along X variable – there is a step down

64
Anika Madaan
🡨 An example of perfect multi-collinearity (because male
and female are perfectly NEGATIVELY correlated; remember male = 0 and female = 1)
🡪 Have dropped one of the variables (female) because of the collinearity

🡪
🡪 For male – need to add male to the constant – the average of the female + the average of the male (the
STEP UP)

🡨 Adding a 0 means there is no constant


🡪 so male is 7.09 and female is 4.5 (same as before)
🡪 So again, can write the model differently
🡪 so sometimes the female is just a dummy variable – sometimes it isn’t

• Dummy variable = categorical variable (2 potential categories 🡪 3 ways of writing it down)


• Constant + dummy
• Constant + other dummy
• Both dummies + no constant
• Various ways to represent the same thing/model that men and women have a different average
wage by including combinations of dummy variables from the following
• “constant” : always equal to 1
• “male” : equal to 1 for men
• “female” : equal to 1 for women

65
Anika Madaan
'
• Which dummies we include exactly will affect the interpretation of the coefficients (β 𝑠)
• If we include “constant” and “male” (“female”) then “female” (“male”) becomes the reference
category
• The mean of the reference category is represented by the constant coefficient
• So categorical categorisation of just 2 variables
• But what about more? E.g., countries/nationalities

Sets of dummies
More classification e.g., levels of schooling
🡪 Create new variable:

🡨 Create educats
🡪 educats = 1 when years == 12
🡪 educats = 2 when years >12
🡪 therefore educats =0 when <12 years

< so 116 with <12yrs; 198 with 12 yrs; 212 with >12yrs
🡪 Here there is a clear ordinal representation 🡪 more years = better
🡪 How to run a regression?
🡪 lm command

🡨 significant +ve coefficient (more education 🡪 more wages)


🡪 What does the 1.7 mean?
🡪 so education is split into 3 STEPS 🡪 12 years (so finishing secondary school) 🡪 and each step means 1.7
more dollars
🡪 but we aren’t able to give value to any of the steps e.g., is it more important that you finish secondary
school or a degree?

66
Anika Madaan
🡨 so create 3 dummy variables here
🡪 so variables that are either 0 or 1 e.g., if educats = that number, it is 1; if not, it is 0

🡨 one drops out and becomes the reference


category and the rest of them are negative; this shows that the people with the highest education category
gets the highest wage and then you STEP DOWN to the others
🡪 so you step down by 2 for edunormal

🡨 here we have changed the reference


category

Key ADVANTAGE of using DUMMY VARIABLES 🡪 CAN see the DIFFERENT STEP UP vs. STEP DOWN
(because if it was linear – would be the same amount going up and down e.g., a certain number each time)
🡪 But with dummy variables – can see exactly how much you need to go up and down
🡪 The increment is not always the same

Testing the validity of a linear model


The estimated values never = the true values
Can do a hypothesis test to see if need a complex model or a linear model
(In this a linear relationship looks like a good approximation)

67
Anika Madaan
🡨 reference here is edunormal
🡪 so testing that eduLOW is EQUAL to -VE eduHIGH
🡪 That you can go up and down the same amount
🡪 LOOKING AT COEFFICIENTS: if eduLOW was the reference point
- Would have to step UP by 1.3 to edunormal
- And step UP by 3.3 to eduHIGH
🡪 So estimating LINEAR:

🡨 High p-value, so we cannot reject. Hence, it


would be valid to use the linear model here

68
Anika Madaan
R – how to create more categories?

^ the second one (with 0) gets rid of the constant and instead of seeing the STEP UP, you just see the
average of each category.

From one regression to another with a control variable :


If coefficient for x goes DOWN- upwards bias
If coefficient for y goes UP- downwards bias

^ now have a dummy for normal; high and a dummy for female too!

69
Anika Madaan
^ so when do we just have the constant = not female + not educated so low educated male and everything
else is comparable to that reference group

So to figure out e.g., the wage for a female – it will be the constant + female
To figure out normally educated female – it will be constant + female + normal

Dummies as dependant variables


• So far we discussed dummies as explanatory variables
• However, we might also have dummies as dependent variables
• BLACK PEOPLE DATA – LESS EDUCATION? 🡪 LESS JOBS etc. 🡪 LOWER WAGES
• But always lots of confounders
• Even years of education (is that Harvard vs. community college)
• Data: CV call backs with black vs. non-black names
• We regress 𝐶𝐴𝐿𝐿 = β0 + β1𝐵𝐿𝐴𝐶𝐾 + ϵ
• Hence, following the discussion in this lecture:
∑ 𝐶𝐴𝐿𝐿𝑖
• β0 = 𝐸{𝐶𝐴𝐿𝐿|𝑁𝑜𝑛 𝐵𝑙𝑎𝑐𝑘} = 𝑖∈𝑁𝑜𝑛𝐵𝑙𝑎𝑐𝑘
𝑛
= 𝑃{𝐶𝑎𝑙𝑙|𝑁𝑜𝑛𝐵𝑙𝑎𝑐𝑘}
• β1 = 𝐸{𝐶𝐴𝐿𝐿|𝐵𝑙𝑎𝑐𝑘} − 𝐸{𝐶𝐴𝐿𝐿|𝑁𝑜𝑛 𝐵𝑙𝑎𝑐𝑘}
= 𝑃{𝐶𝑎𝑙𝑙|𝐵𝑙𝑎𝑐𝑘} − 𝑃{𝐶𝑎𝑙𝑙|𝑁𝑜𝑛𝐵𝑙𝑎𝑐𝑘}
^
• β0 : share of non Black receiving call back (e.g., average divided by no. of people in group; can be
thought of as probability of non-black getting call back)
^
• β1: share of black receiving call - of share of non Black receiving call – STEP (the difference in
probability between the non-black and black group)
• i.e. there is a natural interpretation of coefficients when regressing dummies on dummies
• Things are a bit less clear when regressing dummies on – say – a linear term

70
Anika Madaan
🡨 CVs with “black” sounding names have a 3.2%
lower chance of receiving a call back

Looking at Years of Experience

^ as year of experience goes down – wage goes down by 0.3% points – buttt if you look at each year
individually; can see different effects sooo could use dummy variables

Non-Linear Relationships
• Relationship between explanatory and dependent variables may be non-linear
• There are general methods to deal with this
• However, in many cases we can avoid using different methods because many types of seemingly
non-linear relationship can be represented in what boils down to a linear regression
• CAN EXPRESS AS A LINEAR MODEL AND USE LINEAR REGRESSION
• e.g. suppose you suspect that the relationship between wage and education in wage1.dta is actually
following a quadratic form:
71
Anika Madaan
2
𝑊𝑎𝑔𝑒 = β0 + β1𝐸𝐷𝑈 + β2𝐸𝐷𝑈 + ϵ

Square Relationship

^ having edu squared allows relationship to be CURVED (parabola)


^^ this suggests LATER years matter more – UPWARD SHAPE
^^^ so maybe LATER and EARLY years matter for wages

Log-Linear Relationships
• The most popular non-linear model is probably (EXPONENTIAL)
(
• 𝑌 = exp 𝑒𝑥𝑝 β1 + β2𝑋2 + … + β𝑘𝑋𝑘 + ϵ )
• To make it linear all that is required is to take the (natural) logarithm on both sides of the equation:
• ln 𝑙𝑛 𝑌 = β1 + β2𝑋2 + … + β𝑘𝑋𝑘 + ϵ
• (linear expression of above)
• One of the reasons why it’s popular is the interpretation of the β coefficients it implies

72
Anika Madaan
🡨B=
^^^ so the LOG shows when we change X by 1 unit – Y changes by … much e.g., X% or something so lnY is
the GROWTH RATE
*** when you do ln(1+z) (((basically log of 1 + z))) 🡪 its actually not that different from z itself 🡪 THIS IS
WHAT THE GRAPH IS SHOWING – the lines are basically the same***

When is this model plausible?

What makes more sense?


- An example where absolute terms is not a good term is when you want to see what happens to
wages with more experience (because if you have different jobs e.g., cleaner vs CEO, they will make
massively different absolute amounts but could compare %’s)

73
Anika Madaan
Log-log Model
Internet: This model is handy when the relationship is nonlinear in parameters, because the log
transformation generates the desired linearity in parameters (you may recall that linearity in parameters is
one of the OLS assumptions)

🡨 doing log of Y and log of X


The relationship between these two variables is then referred to as ELASTICITY = unit-free measure
IS USEFUL in ECONOMICS e.g., price elasticity of demand (= can calculate the effect of price changes on
quantity demanded
- If demand is INELASTIC (between -1 and 0) – then might want to increase the price as quantity
doesn’t change too much

74
Anika Madaan
-

Cobb Douglas Example

^ a simple way to describe relationship between output/value added/firm of an economy (Y) and their
production factors (labour/employment and capital) and shows you how much the value changes when you
change production factors e.g., if you add more workers 🡪 can take log of the function and this turns it into
a linear function and can do regression (would go up by AlphaL %)
75
Anika Madaan
Summary
• Don’t fall in the dummy variable trap
• The same model can be represented in several ways
• Be careful with interpretation of dummies
• A lot of stuff that looks non-linear at first glance is linear after all

Extra: Interactions

🡨 so this doesn’t allow us to see


anything actually complex, like comparing the discrimination between educational levels (the difference
will always just be BETA female) soooo, instead…

Then the normal for example becomes: BETA female + BETAfem X norm:

CAN DO THIS ON R

76
Anika Madaan
🡨 so low is the reference; can see that normal
means higher salary on average and high means even higher salary on average
^ can also see that for female gender – there is an effect and it is even bigger for the normal group
- E.g., in the normal group, it is almost 1.7 lower (-1.3 + -1.3)
- But no significance so could interpret it as saying, lets go back to our original model because the
difference in education is not that significant

Interactions for linear models/continuous variables

🡨 can see that we can plot models for each group


^ the education for high experience group would be the reference (low) so 0.028 + the high so 0.077 =
nearly 0.10 – so every year of experience gives 10 cents more whereas for the low, its only 2-3 cents more
(0.028)
- And for the medium education, it’s pretty much the same (the slope looks similar too) and it’s not
significantly different either

77
Anika Madaan
^ so the model expresses that experience goes up in a linear way (hence why you multiple by exper)

(exercise 6 to complete)

Dummies can be used for lots of things…

Lecture 7: Instrumental Variables

^ any model (linear) 🡪 Want to estimate BETA – how X is affected Y but there are other factors that are
independent of X that are affecting Y
🡪 Part of the reason is that there are probability MULTIPLE factors e.g., wages because of education, like
studying, richer country etc. that are driving X
Instrumental variables 🡪 if you can identify at least one of these factors that is independent of Epsilon –
can then use it to potentially find an UNBIASED or CONSISTENT version of BETA

78
Anika Madaan
2 Stage Least Squares Estimator (2 SLS)

^ so if you find such a variable, Z, (Pi can be replaced with any Greek letters)
- Figure out the relationship between X and Z
- Find the estimated relationship
- Then find the relationship between Y and Xhat
- The only thing that can move the Xhat up or down is the Z (if Z goes UP or DOWN)
- So Xhat will NOT be correlated with EPSILON (because Z is not correlated with EPSILON)

Schooling Effects Example

🡪 Academic Talent: thinking we would get an over effect from this variable (because has a +ve effect on
both education and the epsilon)
🡪 Super Nerd: Studying really hard may also have an effect – would have a negative effect on epsilon
because you don’t necessarily make loads of money
🡪 Another factor that affects is cost of attending – could be affected by closeness to college e.g.,
international vs. local student (has nothing to do with talent or nerd or any other factors affecting wage)
- This factor is attractive because also can find data on it
- So use the DISTANCE instrument

1ST STAGE

79
Anika Madaan
^ The F statistic of instruments in 1st stage should be LARGER THAN 10 🡪 here it is 88
2ND STAGE
🡪 Take the estimates from the first stage – PREDICT X VALUES 🡪 THEN USE X VALUES IN NORMAL
REGRESSION
🡪 ivreg() command

🡪
🡪 If there is not much change, then you know that it was a good story but not a good instrument

A graphical representation of IV

3 Key Criteria

1. The problem could be that X is causing Epsilon – that they are RELATED
- So need to make sure instrument is INDEPENDENT (Epsilon can be driving X but should NOT be
driving Z)
- Can argue for this criteria
2. Is easy to find things that have nothing to do with EPSILON but Z must be driving X and must be driving
X quite STRONGLY
- Can outright check with the iv command (look separately at the FIRST STAGE to see if Z is super
significant as a DRIVER OF X)
3. The epsilon mustn’t be driving the Z but also the Z mustn’t be driving the EPSILON
- The only way that Z should affect Y is through X

80
Anika Madaan
^ How can these criterion be violated?
- Lots of colleges in London 🡪 London obviously higher salaries
- Big city colleges 🡪 may draw more high flyers
- May have to include control variables

Control Variables

^ Then instead of being not correlated with ANYTHING – it will just not be correlated with the CONTROL
variables

81
Anika Madaan
^ added regional control variables
^^ changed from 114 to 98 so tells us that there is something in this story
🡪 Now lets see if criterion 1 is still working – that the instrument is still really SIGNIFICANT even after
including all the control variables

🡨 nearby = 0.6 more years of education (SIGNIFICANT)


^ but not good enough to just look at normal significance levels – need to look AT F STATISTIC
🡪 LINEAR HYPOTHESIS COMMAND

🡪 >10

NOTE: reduced form = regressing the outcome on instrument (seeing how they are actually related e.g.,
closeness effect won’t affect people that didn’t actually go to college)

82
Anika Madaan
Weak instrument problem

🡪 Need a strong relationship between Z and X in the denominator (so it will be a large number)
🡪 Because a small denominator can inflate the numerator – will be BIASED
🡪 Can avoid if you have a strong first stage – a STRONG INSTRUMENT
- We want first stage F statistic to be LARGE

Family Size Example

However, there is randomness in family size as well


Two factors that are out of control of controlling parents:
• Occurrence of twins
• Sex of baby
Ok those might meet criteria 1 from earlier but how about criteria 2?
• Twins: families might only have planned for 2 kids, but when they had twins they un-intentionally
had 3
• Many families have preference for a sex mix (a boy and a girl)
• Hence, if they have two kids of the same sex they are more likely to carry on having more kids
nd
2 stage – 2SLS Estimates show NO RELATIONSHIP
- Family size seemed to have +ve relationship but NO SIGNIFICANCE

83
Anika Madaan
- So family size doesn’t seem to affect (is actually a good finding)

Multiple Instruments

^ e.g., Closeness of college + which college + how much college + type of education etc.
🡪 So, need good causal estimates of all of them
🡪 So, if two exogenous X variables then at least 2 INSTRUMENTS (at last)
- Otherwise you wouldn’t know which X variable is being affected by the instrument

Summary
• Endogeneity is often a problem: X(ϵ)
• However, X is also driven by other factors
• If we can find data on at least one other factor Z which is independent of ϵ we can do 2SLS IV
• Can combine with using various other controls to make it more plausible that remaining error ϵ is
indeed independent of Z
• Need to ensure strong first stage
• Finding IVs is a bit of an art

(7) Instruments Exercise


1. Regression of log quantity on log price – what does it show?
- The regression implies a price elasticity of -0.64; i.e. a 1% increase in price will lead to a 0.64%
reduction in demand.
- We can interpret this as a demand curve, if the price coefficient really represents the effect of a
price change on demand holding all other things constant.
- One reason why this might not be the case is if shocks to demand cause price changes.
- On the other hand if we can control for big potential shocks to demand it will be possible to recover
un-biased estimates.
- A big potential demand shock here is the freezing of the lakes: Because rail transport becomes the
only option and hence demand increases it might lead to price changes. Indeed we see that “ice”
has a significant positive effect on demand.

(exercise 7 to complete)

Lecture 8: Time Series

84
Anika Madaan
Time Series data: Different data points represent different points in time 🡪 This introduces some additional
challenges

What’s the challenge of time series data?


ROGUE examples of correlations – the problem is that TIME becomes a CONFOUNDING factor e.g., things
are just growing so if you plot them together, it will seem like they are growing together
- E.g., Economic recession, pandemic – lot of significant one-time things that will create trends or
effects
- May think there is a causal relationship rather than due to an incident in time
- Non-stationary: characteristics of data vary with time

COVID vs. GDP Example

^ red line = covid and there is just a sharp increase in 2019 – lets test regression
🡪 is 5% and is significant!
🡪 NOTE: log of economic activity index – because log of coefficient allows you to have Y change in
percentage (e.g., usually if you increase X, there is a unit increase in Y) but this means % increase in Y

Take control of time… Include time as a CONFOUNDING FACTOR:

🡨 shows that US economy grows by 0.000375% every year


^ 100k cases is the units of the independent variable

85
Anika Madaan
What if time is not linear?

So including a time trend shows whether something grows/shrinks continuously but what about a one-off
event that moves variables in a direction e.g., recession, pandemic
Panel Data = time series data AND cross-sectional data TOGETHER e.g., time data for multiple countries
^ data above shows weekly data for US states so several observations
🡪 Can introduce a dummy that captures the week

Covid Hoaxism Example


Basic regression of covid hoax tweets against COVID cases shows: Hoax share up by 1 percentage point
means 11555 more cases
🡪 When you then control for time - Smaller effect when controlling for time (week) effects (7665)
🡪 So it did CHANGE something
🡪 +ve of PANEL DATA: can deal with growth trends as well as idiosyncratic trends (big things in just one
time period)
- Some states may be more prone to covid – education varies; some states are very rural vs. NYC etc.
and this may show in their tweets
- Can see that when you control for state as well 🡪 is even smaller (3788)
🡪 So PANEL DATA is very USEFUL – can see time-fixed effects and other fixed effects e.g., firm-fixed effects
🡪 BUT not the only problem with time series data…

Autoregression
• Not only could time be a confounding factor but also an assumption that we are making implicitly is
that shocks are not related from one observation to the next
• A particular concern in time series is the possibility that observations are correlated over time
• Simplest way to model this is via an Auto regression:
• 𝑌𝑡 = β0 + β1𝑌𝑡−1 + ϵ𝑡
• 𝑌𝑡−1 becomes the X variable (this is basically the dependant variable in the unit before e.g.,
the week before)
• “the value of today depends on the value of yesterday and some randomness we can’t
predict”
86
Anika Madaan
• So include the past as a specific variable
• We can do normal OLS as long as − 1 < β1 < 1
• With β = 1 we have non-stationarity because of path dependence
• The series can wander off into any direction and never come back
• If that happens OLS is no longer un-biased (different observations are too related to each other)
• Also: if you are interested in 𝑌 = β𝑋 and both Y and X have unit roots you will have a spurious
correlation (the unit root becomes the confounder)
• Random Walk
• Of course we don’t know if this is the case in our data before we start any analysis
• INTERNET: WHAT IS A UNIT ROOT?
• = stochastic (random probability distribution) trend in time series
• Unit roots (when = 1) give us insights into whether time series will recover to its expected
value & if not, then susceptible to shocks and hard to predict and control

Internet: Stationarity
🡪 = an important characteristic of time series
- Time series said to be stationary if statistical properties do not change over time
o So constant mean and variance
- There is a statistical test that we can run to determine if series is stationary or not
o = DICKEY FULLER TEST
o Tests the null hypothesis that a unit root is present
o If it is then p>0 and process is not stationary
o Otherwise p=0 and null hypothesis is rejected and process is stationary

Dickey-Fuller Test

^ So this transformation is because we are worried the BETA might be equal to 1


If B-1 = 0 then NO UNIT ROOT then it is NORMAL and STATIONARY 🡪 That is what we want
But remember DELTA = B-1
So when DELTA = 0, BETA = 1 so NOT NORMAL
When DELTA is NOT ZERO, BETA = 0 so NORMAL and STATIONARY

Using R
library(urca) 🡪 ur.df () command (used to examine a series in dataframe)
⇨ UR dickey fuller

87
Anika Madaan

⇨ Using Covid Hoax Data:

⇨ ^ looks a bit like a regression

⇨ IMPORTANT: lag coefficient = DELTA 🡪 Is that number smaller than zero?

⇨ NO so therefore there probably IS A UNIT ROOT

⇨ CAN ALSO LOOK AT VALUE OF TEST-STATISTIC

⇨ Dickey fuller test provides us with new critical values to compare test statistic to

⇨ ^ can see there is 1%, 5%, 10%

⇨ WE WANT TEST STATISTIC TO BE SMALLER THAN CRITICAL VALUE

So, time series are used a lot in central banks e.g., to predict growth in next week, year, quarter etc. BUT
need to check for autoregression if it is too much dependant on the history

88
Anika Madaan
^ e.g., blue line may look like an upward trend but it’s actually a unit root

Getting Rid of UNIT ROOTS

We need to difference the series to get rid of the unit root – Another EXAMPLE:

🡨 After differencing

So with time series – need to worry about spurious effects due to time trends or unit roots – so before you
run a regression of a series – need to make sure the series is stationary.

Summary
89
Anika Madaan
⇨ Time series can be easy

⇨ But need to worry about how stationary your series is

⇨ If the series clearly grows or shrinks continuously definitely include a time trend

⇨ However, even if it doesn’t grow (or shrink) the series might contain a unit root

⇨ If that’s the case a time trend isn’t enough

⇨ Use the Dickey Fuller Test to make sure you are dealing with a stationary series

***Causality and Unit Roots***


• X causes Y then both need to be integrated in the same order
• i.e. if X has a unit root Y has a unit root as well
• If Y has a unit root but not X then X can (potentially) have a causal effect on ∆𝑌
• If X has a unit root but not Y we should be looking for a causal effect of Δ𝑋 on 𝑌

Lecture 9: Learning like a Machine


We will here focus on predictive analytics: training a model on labeled data (“where we know the right
answer, e.g. dog or cat”) to then guess the answer in another similar* dataset

Examples:
- Predicting whether a picture is a picture of a cat or a dog to prevent spam on social network for cat
owners
- Predicting if a mushroom is toxic or not based on a picture
- Predicting if a person is a republican or a democrat based on demographics
or...
* Similar is a big necessary assumption here, the model learns from the data so if the data is not
representative the model will not work well or even completely wrong

1. Do a logistic regression (interpret in the same way as a linear regression)


2. Which you can then use to make predictions and compute the accuracy of your model

Survived - SEX

90
Anika Madaan
Survived – Age + Class + Sex

Beyond accuracy
Evaluating how often and how is my model wrong

91
Anika Madaan
Confusion Matrix

In R:

Learning irrelevant details of the dataset: Overfitting


We say that a model is overfitting if its predictive abilities (output) depend too much on the data which was
used for learning the parameters.

- The under fit doesn’t match the model but the over fit is WAY too complicated
- Think about adding one new point to the graph, how well will the model perform?
- The more complex the model is, the more likely it is to overfit. Overfitting is a huge issue as the
model will seem to perform great when training it and then perform poorly when applied.

92
Anika Madaan
🡨 so the MORE variables you
add, you will IMPROVE the fit of the model to the dataset! Same with R^2

🡨 here each extra


variable added would actually make things WORSE – so need to find the sweet spot where it is minimised

Preventing overfit: Train – Validation


To prevent overfitting and make sure we have a good estimate how our classifier will perform, we divide
our dataset (at random) into two sets:
Training set: Part of the dataset that we use to train the model. For example to learn the coefficients of our
logistic regression.
Validation set: Part of the dataset that we use to decide on the type of model and depth of the tree

How to do this?

93
Anika Madaan
^ split the data – do it randomly

Decision Tree

🡨 Example of a model that machine would use when making predictions

🡨 Example with gender and the class they


were in
🡪 There is something called a GINI index = the most influential variables = a measure of how unequal the
various ways of classification are
🡪 e.g., if there was no impurity and everyone survived – the GINI would be 0
🡪 So, initially the GINI is 0.48 and GENDER brings the GINI down the most
🡪 Weighted average of the two sex GINI’s is 0.3
- Vs. others is like 0.4 etc.
🡪 Then need to see what brings GINI down next

94
Anika Madaan
🡨 you can stop when you can find an
improvement of GINI more than 0.01

95
Anika Madaan
Lecture 10: Loose Ends

96
Anika Madaan

You might also like