SRM Notes
SRM Notes
Table of Contents
Econometrics with R................................................................................................................................. 3
Introduction........................................................................................................................................................ 3
Probability Theory...............................................................................................................................................3
A Review of Statistics using R.............................................................................................................................. 6
Regression...........................................................................................................................................................8
Lecture 1: Telling stories with and about Data.......................................................................................... 8
Objectives........................................................................................................................................................... 9
Data Stories.........................................................................................................................................................9
Sex + iPhone Example................................................................................................................................................................ 9
Randomised Control Trials (RCT)..............................................................................................................................................10
Xenophobia in UK Example...................................................................................................................................................... 11
Adding a trendline = Modelling the data............................................................................................................11
True Model driving the data vs. Estimate........................................................................................................... 12
(1) Data Stories Exercise..........................................................................................................................15
Lecture 2: R.............................................................................................................................................17
Zoom 1: R Continued...............................................................................................................................20
Regression.........................................................................................................................................................20
OLS Algorithm................................................................................................................................................... 22
Key R Commands for Data................................................................................................................................. 24
(2) R Exercise.......................................................................................................................................... 26
Lecture 3: Visions....................................................................................................................................27
Example: Soho Cholera Outbreak............................................................................................................................................ 28
What can go wrong?..........................................................................................................................................28
Example: Covid Hoaxism....................................................................................................................................28
Scatter Plots............................................................................................................................................................................. 29
Time Series...............................................................................................................................................................................30
Bar Charts................................................................................................................................................................................ 33
Histograms............................................................................................................................................................................... 34
Density..................................................................................................................................................................................... 34
Maps........................................................................................................................................................................................ 35
D3.............................................................................................................................................................................................35
2
Anika Madaan
Multiple Instruments.........................................................................................................................................78
(7) Instruments Exercise.......................................................................................................................... 78
Lecture 8: Time Series............................................................................................................................. 78
What’s the challenge of time series data?..........................................................................................................79
COVID vs. GDP Example..................................................................................................................................... 79
What if time is not linear?................................................................................................................................. 80
Covid Hoaxism Example.....................................................................................................................................80
Autoregression.................................................................................................................................................. 80
Internet: Stationarity......................................................................................................................................... 81
Dickey-Fuller Test...............................................................................................................................................81
Using R.............................................................................................................................................................. 81
Getting Rid of UNIT ROOTS................................................................................................................................ 83
Lecture 9: Learning like a Machine..........................................................................................................84
Beyond accuracy................................................................................................................................................85
Evaluating how often and how is my model wrong................................................................................................................. 85
Confusion Matrix...............................................................................................................................................85
In R:.......................................................................................................................................................................................... 86
Learning irrelevant details of the dataset: Overfitting........................................................................................ 86
Preventing overfit: Train – Validation.......................................................................................................................................87
Decision Tree..................................................................................................................................................... 88
Lecture 10: Loose Ends............................................................................................................................ 90
Econometrics with R
Introduction
- Why R?
o Reproducible
o Many add-ons available on CRAN (comprehensive R archive network)
o Easy to update and expand
- R
o > is prompt
▪ Code entered will be executed
o Objects are defined with <-
▪ E.g., x<-10 means x=10
▪ X is a vector
▪ Vectors can be multiple numbers e.g., 1-5 or text e.g., “Hello”
o Functions
▪ Function name is always followed by ()
▪ E.g., seq() = sequence function
Probability Theory
Basic concepts
- Mutually exclusive results of random process = outcomes
3
Anika Madaan
o Mutually exclusive = means only 1 of possible outcomes can happen
- Probability = proportion of outcome occurring in long run if experiment is repeated many times
- Set of all possible outcomes of random variable = sample space
- Event = subset of sample space; 1 or more outcomes
- Random variable = numerical summary of random outcomes
o Can be discrete or continuous
▪ Discrete = numbers e.g., 0 and 1
▪ Continuous = continuum of possible values
Dice = random
-
- Where n = number of coin tosses e.g., 10 and p = probability so 0.5
- So k = 5 Heads
- Write this as P(k=5)
Expected value of a random variable = average value of outcomes when running for a repeated number of
trials
- For discrete variable = weighted average number of possible outcomes
o Weights are related probabilities
Random sampling in R
Sample (1:6, 3, replace = T)
^ Dice = 1:6; x3 samples; replace is putting the number back
R is actually pseudo-random number generator (not really random but close enough)
Standard Deviation of discrete random variable Y = deviation of random variable from its mean
4
Anika Madaan
- The SqRoot of Variance
Variance of discrete random variable Y = squared deviation of a random variable from its mean
Sample variance = how observations are dispersed around the sample average
Population variance = dispersion of whole population around average
-
- Can obtain density at different positions using:
o Dnorm ( x = c(…,…)
The F distribution
- Degrees of freedom
- Related to other distributions
Random Sampling = Objects drawn at random from a population – each object is equally likely to end up in
the sample
- Any function of two random variables = also random
5
Anika Madaan
Average of a random sample 🡪 is a random variable itself
- This random variable has probability distribution = sampling distribution
To examine the distribution of univariate numerical data: plot it as a histogram and compare it to some
known or assumed distribution
- Will give frequency histogram
Unbiasedness = If the mean of the sampling distribution of some estimator ^μY for the population
mean μY equals μY
6
Anika Madaan
Consistency = the uncertainty of the estimator μY to decrease as the number of observations in the sample
grows 🡪 probability to get closer to 1
p-value = probability of drawing data and observing statistic that is at least as adverse as what is stated
under the null hypothesis as the test statistic actually computed using the sample data
The significance level of the test is the probability to commit a type-I-error we are willing to accept in
advance.
- E.g., using a prespecified significance level of 0.05, we reject the null hypothesis if and only if
the p-value is less than 0.05. The significance level is chosen before the test is conducted.
An equivalent procedure is to reject the null hypothesis if the observed test statistic is, in absolute value
terms, larger than the critical value of the test statistic.
- The critical value is determined by the significance level chosen and defines two disjoint sets of
values which are called acceptance region and rejection region.
- The acceptance region contains all values of the test statistic for which the test does not reject
- While the rejection region contains all the values for which the test does reject
The p-value is the probability that, in repeated sampling under the same conditions a test statistic is
observed that provides just as much evidence against the null hypothesis as the test statistic actually
observed.
The actual probability that the test rejects the true null hypothesis is called the size of the test. In an ideal
setting, the size does equals the significance level.
7
Anika Madaan
The probability that the test correctly rejects a false null hypothesis is called power.
A 95% confidence interval for μY is a random variable that contains the true μY in 95% of all possible
random samples.
Like variance, covariance and correlation of 2 variables = properties that relate to the (unknown) joint
probability distribution of these variables
- Estimate covariance & correlation w/suitable estimators using a sample
- Sample covariance = estimator for population variance of X and Y
- Sample correlation = can be used to estimate population correlation
Regression
To build simple linear regression model, we hypothesize that the relationship between dependent and
independent variable is linear, formally:
8
Anika Madaan
Lecture 1: Telling stories with and about Data
Objectives
- Learn basic data analysis tools for economics & business
- Learn commands on basic statistical software
- Use data analysis as a decision-making tool
Data Stories
Base decisions on evidence – can be anecdotes/case studies
- These days we have vast amounts of data – BIG DATA
o E.g., consumer activity
o Country data
- Need powerful tools to make sense of this
- Economics + statistics = econometrics
Need to tell a story with data – what is the data telling you? Imagination
-
9
Anika Madaan
- Arrows represent causation
- Think about mechanism – why?
o Is it wealth, taste, style, more iPhone dating apps?
o People with iPhones are people who brag more? Will they exaggerate their sex as well as
their phone?
o iPhone users meet more people? Trendy?
o Android users too busy?
o Having more money? – Drives both sex and having an iPhone
o
o If there is no TRUE +ve relationship between iPhone and sex; it is SPURIOUS relationship
o Or if there is truly a +ve relationship due to another factor = CONFOUNDING factor
o Or this relationship found could be bigger than this causal relationship (e.g., there is a small
relationship between iPhones and sex but the money effect makes is seem stronger; because
money is what actually affects it) = UPWARD BIAS
o Upward bias: our estimated effect is larger than the true one
o There could be other things going on e.g., work hard 🡪 leads to more money which leads to
iPhones BUT hard work also leads to less time sooooo less sex???
▪ is it negatively correlated?! But not what the data found – but there’s lots of things
going on in data
▪ or if it’s the case that iPhone users actually have even more sex
▪ = Downward bias: our estimated effect is smaller than the true one
Endogeneity Problem: the above problem is driven by factors that are endogenous to the variables
(money, work etc.)
🡪 Need exogenous variation instead
🡪 Where does that come from?
o
o ^ examples of RCTs in economics fields
o There are lots of crazy relationships (but can be explained by story)
- In order to make true relationships (causal) 🡪 need to do actual quantitative data analysis
- MODELLING
- Use mathematical formulas
Xenophobia in UK Example
Since 2010 – UK vilifying foreign citizens
Claim: Foreigners responsible for crime
🡪 Hard to do an RCT for this kind of example because only ONE UK – cannot have a “control”
May look at correlation between % of foreigners and % of crime rates in different UK authorities
- Correlation of 0.4333 🡪 not that high (HIGH +ve correlation = 1; HIGH -ve = -1)
- +ve relationship
- Confounding factors: income, population size etc.
11
Anika Madaan
B0 = when there is 0% (e.g., crime rate with 0 foreigners)
B1 = slope; the causal relationship between the two variables (e.g. when foreigners go up 1%, how much
does crime go up?)
- Every observation, i
- To describe every individual data point, additional variable E for each i
- E measures gap between observation and trendline
o Residual or Error Term
o = everything the first term of the model cannot account for
o
o So the i describes the dataset – an index for the row of data in your table you are dealing
with
o The epsilon E describes the gap between the trendline and the datapoint (residual/error)
o Residual/Error = gap between what the model tells you and what the actual data is
12
Anika Madaan
True Model driving the data vs. Estimate
This can be used to uncover the true relationship that is out there
e.g., the true relationship between foreigners and crime
So, the blue line is our guess (our estimate of the true model)
The Green is the true model that is out there.
Why is there a difference between the true model and estimated model????
- So the estimated model is just that – an estimate
o It is estimated based on a sample
o A sample of finite data
▪ If we think about alllll the possible samples that are out there – its infinite (so cannot
get the true value)
o So it is not SYSTEMICALLY WRONG
o It is not right (so by definition it is wrong), but not systemically wrong because if we took
another sample of similar data, we would not make the exact same mistake
▪ Would get something that is slightly different
o E.g. random sample average of dice rolling could be 4.8 but true average is 3.5
o May get a slope that is too high or too low – unlikely to get the exact same (green) slope
again
- There could be confounding factors that make us systematically over- or under- estimate the actual
parameters
o Other things that are driving the presence of foreigners in the area or crimes in the area
o Or other things affecting dice throws
o E.g. what other factors could be driving the data (apart from foreigners being criminals)?
13
Anika Madaan
If we’re considering confounding factors when thinking about the true (green) model – let’s firstly consider
population
When there are higher population 🡪 higher numbers of foreigners 🡪 so E will be bigger
Whereas smaller city 🡪 less population 🡪 less foreigners 🡪 less crime 🡪 E will be lower
^ the Red line above (from dataset) has upward biased relationship (compared with true causal model)
14
Anika Madaan
Confounding factors: let’s secondly consider unemployment
^ so the red line above has downward bias, the slope is too flat, we have underestimated the relationship
So depending on which confounding factor we are considering, we could be over or under estimating the
relationship.
15
Anika Madaan
(1) Data Stories Exercise
1.1 “Study Finds Negative Effects of Police-Worn Body Cameras” - Police body worn cameras increase
violence against the police?
RALF: Would typically expect the opposite, i.e., if people are on camera it is more likely that they can be
prosecuted and punished for attacking a police officer 🡪 But story?
- Maybe there are people who like a reputation as misfits - badge of honour in their peer group?
- Now what could be better for those people than being known and seen on camera for attacking a
police officer?
- Or? Perhaps having a camera changes the behaviour of the officers. They now feel more confident -
indeed perhaps overconfident - to engage with people that are more prone to attack officers.
1.2 “Adding guacamole could boost online daters' popularity” 🡪 Guacamole makes you more successful at
online dating?
Maybe:
- Causal? Avocados are more expensive 🡪 more money 🡪 makes you more popular in online dating
- Confounder? Guacamole is quite trendy 🡪 could be an area of common interest for more people 🡪
more popular at dating
RALF: The headline suggests that there is a positive causal effect from mentioning Guacamole to receiving
responses in online dating. A driver for this is could be that Guacamole is a signal for a healthy lifestyle and
16
Anika Madaan
health and fitness are qualities that are desired when looking for a romantic partner (The story of the
article) OR:
- Perhaps younger people are more likely to be into guacamole. But younger people can also be
expected to get more dating responses. This would imply an upward bias of the Guacamole->Dating
success effect.
- A taste for guacamole could be more prevalent among people living in cities. People are more likely
to respond to people that are close by. Hence, people in cities will have more people close by and
therefore receive more responses. Again this implies an upward bias.
People who go to museums like to take care of their mental and physical wellbeing; mental by stimulating
themselves at museums and physically also 🡪 therefore they live longer (less diseases)
People who have more money and more leisure time are more likely to go to museums; these people are
also more likely to spend more money on gym memberships because they have the time and money to do
so, therefore they live longer and healthier 🡪 upward bias.
RALF: It’s conceivable that visiting a museum calms you down, allows time for reflection and leas to insights
that help you to be more healthy and less stressed and as a consequence makes you live longer.
Equally, museum visits could simply be conflated with other more clearcut factors that make you live
longer. Income and education would be clear candidates.
That said, the study quoted in the article has already accounted for these. But there are other factors that
are not easily controlled for. For instance, having a stressful job with little spar time it probably not good for
both finding time to go the museum as well as for your life expectancy. This mechanism would bias the
museum -> life expectancy effect upward (stressful job is negatively correlated with both museum visits
and life expectancy)
1.4 “Secret to Winning a Nobel Prize? Eat More Chocolate” 🡪 Eating chocolate helps you win a Nobel?
RALF: A New York doctor wanted to investigate the effect of flavanols, compounds found in chocolate but
also tea and red wine, on cognitive ability. He used the number of Nobels as a convenient proxy for his
outcome variable and country-level chocolate consumption data as his explanatory variable and got a
highly significant positive WHAT ELSE THO?
- It might well be that we have an omitted variable here - wealth - which drives that positive
relationship.
- A richer country (like Switzerland, which has 26 Nobel winners) will have more resources to invest in
research and its affluent citizens might be more likely to be able to treat themselves frequently to
chocolate.
- It might also be that people who study (and are therefore more likely to get a Nobel, of course),
need a sugar fix more often and snack more.
17
Anika Madaan
- In both of these cases we’d have an upwards bias as wealth and time spent studying are positively
correlated with both chocolate consumption and the number of Nobel prizes.
1.5 “Sex Makes You Rich? Why We Keep Saying “Correlation Is Not Causation” Even Though It’s
Annoying” 🡪 Sex makes you rich?
RALF: It might be the case that having sex triggers a rush of hormones, which then make you more
productive and give you an edge in the workplace.
- But it is perhaps more plausible (and this is in fact what the original study claims) that sex is another
indicator of health, which is our omitted variable. In this case, we would have an upwards bias.
1.6 “Does marriage make people happy, or do happy people get married?” Marriage makes you happy?
Upward bias: People who had successful parents who had good marriages are more likely to get married
and are also less likely to have issues in the future so more likely to be happy – upward bias because
successful parental relationships has a positive correlation on both marriage and happiness
RALF: Reverse causality is a tricky nut to crack. This study finds evidence that happier singles opt more
frequently for marriage and that benefits of marriage vary widely among couples – not surprisingly an
equal division of labour at home is an important factor driving happiness in marriage.
1.7 “Do Night Lights Cause Myopia?” 🡪 Sleeping with a night light as a kid makes you blind later?
Kids who have night lights are more likely to be anxious and shy 🡪 anxious and shy kids tend to spend a lot
of time on the internet and computer rather than talking to actual humans 🡪 lots of screen time ruins their
eyesight 🡪 leads to blindness
RALF: Short-sightedness is a growing problem globally - culprit: night-lights. The data, at face value, seemed
to have suggested that kids who sleep with night lights develop myopia later in life. According to the
researchers, the story was that the light makes the eye develop abnormally, which then affects focus.
There could be a more obvious explanation, however, which is that myopia is hereditary – myopic parents
are more likely to install night lights in the house, including their children’s bedrooms, but the light itself
has nothing to do with their kids becoming short-sighted. This is, again, a case of upwards bias as myopia in
parents is positively correlated with the presence of night lights in the house and with myopia in their
children.
18
Anika Madaan
Lecture 2: R
Why R?
- Free
- Open source; Lots of contributors 🡪 Lots of extensions 🡪 new methods of data analysis
- Integration with other programs
Cons?
- Open source; Lots of contributors 🡪 Many different ways of doing the same thing
- E.g., lots of help functions
R vs. RStudio
- R is the computer doing all the calculations and RStudio is the controls around the engine where
you can see what you are doing
- Need both software’s
Create variables e.g., v1 = runif (100) (variable 1 will be 100 random numbers)
Plot variables e.g., plot (v1, v2)
df = data.frame(v1,v2) 🡪 the variables will be in a table
19
Anika Madaan
# this is how you add comments – so NOT code
R markdown = .rmd
Knitting the document: combines everything together 🡪 will formulate the document with all the code put
together
- html document
- can also publish as a webpage
20
Anika Madaan
^ library(ggplot2) then can use ggplot command 🡪 much more sophisticated scatter plot
^^ ff is the dataframe (all our variables)
^^^ aes = the basic parameters you are trying to draw (= aesthetic)
^^^^ + geom_point() 🡺 how to make it points (as opposed to like lines or something)
NOTE: independent variable = x-axis (the cause); dependant variable = y-axis (the effect)
Zoom 1: R Continued
‘’’{r}
……………………………….
‘’’
^ this is a chunk of code
Regression
So, we always want to find the true line rather than the estimate (the trendline)
21
Anika Madaan
^ we also want to figure out the slope and intercept of this line then we can understand this model
Here: A one percentage point increase in the share of foreigners leads to 0.025 more crimes per capita in
a given year
Note: This is not necessarily a statement of fact as it depends on the precision of the estimate and the
possibility of bias. Rather: it is the implication of our estimate if we took it at face value.
Google Definition: Regression = the statistical processes for estimating the relationships between a
dependant variable and an independent variable
22
Anika Madaan
The lm command has tried to put a line in the scatter plot
So trendline describes the econometric model of the relationship between the outcome (Y variable) and
the explanatory X variable
Summary(r1) is below:
23
Anika Madaan
^ r1 is a list of variables, one of which are coefficients – which are the parameters which have been
calculated
Beta0 = the intercept
Beta1 = the slope
So now we have defined the model – and fitted it to our data = estimated
OLS Algorithm
R finds the estimates of β0 and β1 by minimising the sum of squared residual (hence least squares)
^ ^
β0 = 𝑀𝑒𝑎𝑛(𝑌) − β1𝑀𝑒𝑎𝑛(𝑋)
It is squared because we care about how close we are to the points 🡪 the bigger the arrows are, the further
we are from the points
- But we don’t care whether we are away on the +ve or -ve side
- So SQUARE
-
- For every observation – we square it, calculate the sum and see how big it is
- Across all the observations in our dataset
- And we would like for it to be not so big
o This means that our line will be not too far away from these points
-
- ^ so, B1 estimate turns out to be, the covariance between X and Y variables, divided by variance of X
- So, if you have 2 variables that are +vely correlated 🡪 +ve covariance 🡪 then you get a +ve slope
- But if -ve covariance 🡪 -ve slope
Say if there was no relationship and B1 was 0 then B0 would just be mean Y – because mean X times 0 is 0
24
Anika Madaan
Residuals = E
For every observation – there are residuals 🡪 new data frame called residuals ^^^
^ cor = correlation matrix – select picks the selective ones so ‘residuals’ and ‘b_migr11’
You can see that residuals correlated with residuals = 1 (they are the same no.)
The other way round – is so small 🡪 is virtually 0
Why is that interesting?
- The residuals are absolutely unrelated to the X variables
*** Online explanation: “the idea of Simple Linear Regression is finding those parameters α and β for
which the error term is minimized. To be more precise, the model will minimize the squared errors:
indeed, we do not want our positive errors to be compensated by the negative ones, since they are equally
penalizing for our model” ***
OLS on YouTube:
Residuals = Ei (so the difference between real Y (on the trendline) and estimated Y (the observed data set))
- e.g. -ve residual is when dot is below the trendline
25
Anika Madaan
((Least squares estimates the unknown values of the parameters B0, B1 in the regression function
Yi = B0 + B1Xi + Ei ))
So if we wanna figure out how given data measures up to the trendline – we wanna minimise ALL the
residuals! – so, a combination of all residuals?
- BUT
- This is tricky as some are +ve (above trendline) and some are -ve (below trendline)
- Hence, we add up the SQUARES of all residuals – this reduces the -ve and +ve
o Also useful because anything with larger residual – will become even larger
26
Anika Madaan
Defining a function:
Loops:
In any programming software – have some form of loops to REPEAT anything 🡪 avoids writing a command
over and over again
- for {} – anything between those brackets is what is going to run over and over
- so first we have created ‘regions’ which is the list that you want to work with
27
Anika Madaan
- %>% unique – prevents repeats
- *** remember ‘plotter’ and ‘inner’ need to already be coded into the memory before the above can be
run ***
- so, need to run ALL CHUNKS OF CODE above first
- then the loop will work
(2) R Exercise
To load a dataset with R command?
- Auto <- read.csv (“C:/file_path_way/auto.csv”, header = TRUE)
- header=TRUE 🡪 column names will be the headers
- to find pathway directory:
o getwd()
Installing packages?
- Install.packages(“dplyr”)
- Libarary(dplyr)
We can see for area, there is no max or min area because it is a discrete variable (character string)
Q1= The middle value that falls between the smallest number in data set and the median
Q3= The middle value that falls between the median and the largest number in the data set.
PIPING?
- Piping allows you to send the result from one R command to the next.
- E.g. auto %>% summary()
- ^ is the same is summary(auto)
Max Command?
covid_july %>% filter(deathsOcases==max(deathsOcases))
28
Anika Madaan
^ Now what do we expect to find? To die from COVID you have to get sick and it will take at least a couple
of days. Hence, we would expect to see death rates to be low at first and then to increase.
- Over time we would hope that doctors get better at treating COVID patients.
- Also, people who know they are at risk will increasingly shielding so that the only people who get
sick will be those who are less likely to die.
- Both factors should bring the death rate down after an initial peak.
- This is what we are seeing when only looking at the series from the end of March onwards with a
peak in mid May.
- However, there is a first peak in early March. What could explain this?
- The most plausible explanation is probably measurement error: early on cases of infections where
probably not counted properly as there was no systematic way of testing the population. On the
other, if somebody was so ill that they die this would almost certainly be picked up by the
authorities.
- Hence death figures were not under counted.
- Also, early on numbers were low.
- So a couple of counts too few for cases could easily swing the ratio
Lecture 3: Visions
Objective: Be able to use diagrams & visualisations to represent data on R – are the best way of showing
data and telling a story.
29
Anika Madaan
Example: Soho Cholera Outbreak
^ most of the deaths concentrated around the pump – good example of visualisation; tells the right data
30
Anika Madaan
Scatter Plots
31
Anika Madaan
^ so by adding the density of states, we can see that hoax tweets are more closely linked with deaths in
more densely populated states 🡪 but the relationship is not monotone (slope is flatter in top quartile than
2nd and 3rd quartiles)
🡪 all we needed to add to the simple scatter plot command is the color=dens_quart argument as part of
the plot aesthetic
Time Series
🡪 We might ask if hoaxism has died down as the crisis progressed – using day by day data
- Would be quite long; how to shorten?
- Aggregate across states
32
Anika Madaan
^ make DATE into graph: time series
^ same thing but can see what happened in each month of 2020 and can do the same etc. with like weeks
etc.
usbyday=usbyday %>% mutate(week=round_date( date, unit = "week"))
33
Anika Madaan
< might want to run regression on this data with the number of deaths
to see if there is a relationship between Hoaxism tweets and deaths – but with time series, can overlay the
death time series alongside!
34
Anika Madaan
So what is the story here?
It is interesting to see that every major wave of deaths was preceeded by a flare up of hoaxism a couple of
weeks earlier
- (e.g. in February we had a flare up of hoaxism followed by a spike in deaths in April. Then in May
hoaxism was strong again followed by a death spike in late July. Worryingly in early September we
are seeing another spike in hoaxism).
Of course, this has to be taken with caution. A different story could be that the death spikes cause hoaxism
(Although the first hoaxism spike could not fully be explained by that). This could be due to a phenomenon
we also sometimes see with religious believes:
- if delusional beliefs (e.g. an apolcalyptic prophecy not being true) are challenged by reality (e.g. by a
spike in people actually dying) in some cases the delluded rally even more closely around their
dellusional believes because the cost of stopping believing have now increased.
- For instance in the COVID case you now not only have the embarrasment of having believed
something silly but you might have to accept responsibility for behaviour that killed others, maybe
even loved ones.
Bar Charts
Can see if Hoaxism is more intense on some weekdays?
< Ta da
35
Anika Madaan
< tidied up and ‘fill’ gives different bars different colours
🡪 seems hoaxers are particularly active on Sundays
Histograms
🡪 Allows us to look at the distribution of a variable across a sample. E.g., we can look at the daily hoax
shares
ggplot(data=usbyday, aes(hoaxsh*100)) + geom_histogram() + ylab("Number of
days")+xlab("Hoaxshare in %")
Density
🡪 Often preferable to histogram - multiply the density with the width of a histogram bin you get the share
of observations (as opposed to the count) that fall into a particular category
ggplot(data=usbyday, aes(hoaxsh*100)) + geom_histogram(aes(y=..density..
,fill=..density..)) +
ylab("Density") +
xlab("Hoaxshare in %") +theme_minimal() + geom_density()
36
Anika Madaan
Maps
One of best ways to visualise data
^ To convey data (e.g. the share of hoxism) we can use a heat map
37
Anika Madaan
D3
D3 is powerful javascript library to make interactive web based figures and visualisations. It’s particularly
cool for visualising networks. The R package networkD3 provides a simple interface to make some of the
functionality available in R.
- E.g. we can create flow diagrams (also known as Sankey Networks)
Where more foreigners = more crime 🡪 is this relationship strong enough to confirm this relationship?
🡪 Could be that there is no relationship at all?
🡪 At some point, it becomes unlikely that there is no relationship e.g., the steeper a slope gets
🡪 Need to determine how certain these estimates are
38
Anika Madaan
< R code making our own data; does not look like
there is a relationship between the 2 variables – which makes sense because it was random data
- And the B1 = 0 so again would expect no relationship
BUT, when we run regression on the above data, we get: -0.81 – huh? How? Isn’t that really big?
- Let’s do the whole thing many times:
-
- So, this is what we would expect from no relationship
39
Anika Madaan
If the data was continuous – it would make the histogram into bins, like above bins of 0.5
Density 🡪 re-scaling of the first frequency graph – multiply each bar by a number so the interpretation
changes slightly
- Then the area of the bar matters more
- So density shows % vs. number
40
Anika Madaan
- The blue line shows the bell shape of THIS data and graph – smoothed line of THIS histogram
- Orange line shows actual formula – normal distribution
- The more times you do it, you see the blue curve get closer to the orange curve
Now we can see how our estimate is distributed, so anything that is in the tails would be the EXTREME or
the UNLIKELY
^ black line is baseline (the original sample); blue line is what happens with a much smaller dataset
🡪 much flatter curve; so probability of being in the middle has gone down
🡪 probability of being in the tails is much higher; certainty has gone down
🡪 going away from the true value – probability is much higher – is much harder to pick out the TRUE value
🡪 few data points = uncertain true value
Variation in Epsilon
^ changing epsilon: the range is much larger; but blue curve is still flat
What does this show?
- So, the X obviously has an effect on Y but the epsilon also (which has nothing to do with X) has an
effect on Y
41
Anika Madaan
- If there is much more stuff going on in the epsilon/residual – it will be more uncertain trying to
estimate Beta and may be further away from TRUE
^ Density of x – really weird curve, not even a curve really; x that varies a lot more
^^ Graph 2: shows when there is a bigger difference with the X values – it is much easier to figure out what
the effect is 🡪 the Beta values are much clearer
E.G., if you look at effect of GDP on suicide? If there are small differences in GDP, may not clearly see effect
on suicide but if you have a really BIG drop in GDP there will be a much more clear, obvious effect on
suicide
🡪 SO, when the X varies a lot more, the Beta becomes clearer
Non-normal Epsilon
42
Anika Madaan
^ From those NON-NORMAL EPSILONS, it shows that it doesn’t matter – the blue line is still normally
distributed
- So, no matter what epsilon is, we will end up with a normal distribution
- = central limit theorem
o If you take any random variable and calculate average and sample it various times – you will
end up with something normally distributed
o (but not always the case – as shown on the LEFT; there is a smaller sample size there – just
10 samples)
o Thus, need to make sure the number of observations is big enough
Recap:
43
Anika Madaan
- Regression estimates are (approx.) normally distributed
- We can work out the variance
- Normal distribution is fully characterised by standard error and mean
- To work out the likelihood of that a value of a particular value arises, we can work out the area
under the density
- We can define how much risk of being wrong we are willing to accept and then work out a critical
threshold (= significance level)
P-Value
T-Statistic
45
Anika Madaan
T statistic is how far away the estimate is away from its hypothesised mean
You test the t statistic with the t distribution
^ So even though the data may not be standard normally distributed, by calculating the t-value, we can
compare it to a standard normal critical value
🡪 7.4>1.96 so we can reject the hypothesis
46
Anika Madaan
More or Less Significant Estimates
• If we have a lower significance level (e.g., 1%) we are less likely to reject a hypothesis
• Is always harder to reject at a lower level because we don’t want to risk so much to be
wrong
• This is to avoid making the Type I error
^
• If we still reject the β=0 on the basis of an estimate β we say that the estimate is highly significant
• If we would only reject the hypothesis with a much higher significance level (e.g. 10% instead of 5%)
we say that the estimate is only weakly significant
• Have to be cautious – might be wrong
Another Example
47
Anika Madaan
< true data is green line
- Simulated epsilon shots BLUE: quite linear – very similar to the true green line; allowing for very
little spread in epsilon (so very big spread in beta)
- Epsilon RED: come up with almost anything – very large spread in epsilon so very small spread in
beta
Testing Beta = 0 is the most common test but there could be other areas of interest
48
Anika Madaan
Here we are having Beta as -1 (because the hypothesis is that when you lose a year of school, you will lose
a year of experience)
- Any test can be expressed with thresholds – so with t-value, we can simply use the formula but for
the p-value, need a command
- library(“car”)
- linearHypothesis(mod_earn_exp, c( “S - -1”))
- ^ S is the coefficient associated with S variable (school)
-
- P-value = very small and different to the p-value in the original ouputs ^^^
49
Anika Madaan
Estimation of Variance of Beta
- The differences in these ^ t-distributions are the normal middle is a bit lower, but the tail ends are a
bit higher
- So, the likelihood of getting something far away from the mean/true value is higher than with the
normal distribution
o This is because you have to use standard error of BETA in t-value calculation
o Standard error of BETA is already an estimate in itself
o So greater UNCERTAINTY in your estimate
- There are also multiple t-distributions. Why?
- Depends on degrees of freedom = how many observations – no. of parameters needed to estimate
- So, in this case: observations – 2
- But sometimes have more parameters and more complex models so need to subtract more from
the degrees of freedom
- ^^^ The higher the degrees of freedom, the closer the distribution is to the normal
o So, having a really high number of observations means you don’t have to worry too much
50
Anika Madaan
Critical Values T-distribution
So calculating the critical value ^^^, need the degrees of freedom – here we have used 10 (maybe we had
12 observations and 2 parameters to estimate, beta 0 and beta 1)
- The number is then -2.23
- This number is quite different from 1.96
- The more degrees of freedom you have, the smaller the difference between the 2 numbers get
- e.g., when we did 1000, it was virtually the same number (1.96)
QS: >
The t-value (rounded up to 3 decimal places) is?
Can she reject H0 at at the 5% significance level?
Can she reject H0 at at the 1% significance level?
ANSWERS
-2.857
Yes
51
Anika Madaan
Yes
NOTE: I used the rejection threshold values for normal distribution (are 1.959964 and 2.5758293 for a 5%
and 1% significance level). For the t distribution the same values are 2.0017175 and 2.663287 (Note that
we have 58 degrees of freedom)
We can also work out p-values - Remember that the p-value is the probability - assuming the H0 is correct -
to have a value more extreme (further away from 0) than the one estimated.
- We can use the pt() command for that, which gives us the cumulative density function of the
t-distribution; e.g. pt(0,58) gives us the probability to have a value smaller than 0.
- Because the t-distribution is symmetric that will always be equal to 0.5
- Note that we have 58 degrees of freedom here as we 60 observations and we need to estimate 2
parameters (β0 and β1).
- To work out the p-value call the pt() function with the t-value for a given estimate; e.g. for part a)
- pt(-.2/0.07,58)
- [1] 0.002962872
- i.e. the probability of having an estimate lower than -.2 is 0.0029629.
- Note, because the our test considers the possibility of being too low and too high (and because the
distribution is symmetric) we need to double this to get the actual p-value which becomes
0.0059257.
- However, this is still below 1% so we can safely reject the hypothesis that the true parameter is 0.
Objectives:
- Multivariate regression – further control variables in regression
- Understand that more variables is not always better
- Learn hypothesis tests on several parameters
🡨 Wage regression:
- Can you expect a higher wage after studying for
longer
- But here we capture GENDER too
o But we cannot code qualitative things
e.g., gender, which school
o But we can set it to equal 1
▪ But doesn’t just depend on +ve or -ve correlation – but also whether what is missing
has a +ve or -ve effect on DEPENDANT variable
▪ Women = -ve relationship with education (so downward bias right?)
▪ So upward bias
▪ Also – the more education you get in your dataset; the less likely you are to find
females in your dataset – so again will be driving up wages
- So how do you get rid of this bias?
-
Red = regression line (too steep)
Green = true line; much flatter (so
upward bias)
53
Anika Madaan
(but also slightly to the left; highest educated person is man)
🡪 can also see dashed lines which are the extended model – not really true but there are two groups and
one group has a higher line and the other has a lower line but the intercepts are different – but the SLOPEs
are the SAME (just slightly lower)
🡪 Can read the coefficient of gender as the DIFFERENCE in INTERCEPTS = B3
54
Anika Madaan
🡨 not always clear cut; so here one
variable is causing experience and therefore causing wages but there might be other factors going on too
e.g., maybe what is driving how much education you have is actually how much experience you’ve gotten
(motivating you do get more education?)
- Then how do you know whether to make it a control or not???
- Say we are only interested in the effect of education but are worried about the confounding factor
experience – we can run a regression with and without experience and see if education co-efficient
is more or less the same? Then know that it won’t make a difference
- If it does change a lot, then we know we cannot make strong conclusions with the current data and
need more research or other strategies e.g., instrumental variables or if you find other data about
schooling and vocational training
- If the causality between the two explanatory variables goes both ways we are in trouble as far as
finding the causal effect of EDUC is concerned (we are cool for finding the ceteris paribus effect).
Both including or dropping the gender variable will lead to a biased estimate. We have to use other
methods some of which we shall discuss later in the module. INSTRUMENTAL VARIABLES IF THERE
IS REVERSE CAUSALITY.
55
Anika Madaan
^ so one more year of education = 1.4 years LESS experience
- 🡨 gone from
0.54 to 0.64
56
Anika Madaan
🡨 more female = less education (not super significant
(0.05 p value) but still fairly significant = 0.5 of a year less education on average
57
Anika Madaan
Perfect Multi-collinearity
^ once we have more variables – we don’t just worry about X and Y but also causal relationships with extra
control variables
- Perfect multi-collinearity = how correlated the various control variables are
- When one control variable goes up, the other goes up (or down) to the exact same degree
- Simplest way to generate this, is to create a new variable e.g., EDUC_in_days = EDUC*365
o Looks like very different numbers but they are proportional (see cor() = 1)
- 🡨 N/A for
educ_in_days because is the same as EDUC (in years)
- Could be that a student may drop out halfway through a year and Educ in days would be different to
Educ in years – then wouldn’t have perfect correlation
- So R identifies that two variables in dataset have a perfect linear relationship so they will drop one
of the variables
58
Anika Madaan
- (the coefficient would be different to EDUC in years but what it represents would be the SAME
because now the unit doesn’t represent the change in one year, it is the change in one day – same
thing as saying salary in dollars vs pounds – number will change but not actually earning any more)
R2 = 100
- Can make R2 artificially high
59
Anika Madaan
Finding R2
• Accounting is not necessarily explaining
2
• 𝑅 is mechanically increasing as we add further variables
2
• If we have as many parameters as observations 𝑅 is always 100% (e.g. consider 2 observations)
2 2
Hence Adjusted 𝑅 = 1 − ( 𝑛−(𝑘+1
1−𝑅 )(𝑛−1)
• )
where k=Number of variables
• i.e. the higher k the lower 𝑅
• ALWAYS REPORTED IN REGRESSION
•
• So if you include as many variables as you have observations 🡪 the DENOMINATOR becomes very
SMALL 🡪 so ratio becomes BIG
• ^^^ the R^2 is quite small because >500 observations and only 3 variables
• High R^2 🡪 (not as important for causality) useful for predictions
Imperfect Multi-Collinearity
🡪 Explanatory variables are closely but no perfectly correlated – CLOSE relationship but not exactly PERFECT
🡪 Consequences:
• We can estimate all coefficients – so will still get output in R (instead of NA)
• Variance of estimates might be high i.e. estimates could be quite far off from true value.
• However: estimates will be unbiased (if x not correlated with ϵ)
🡪 Can do VIF (variance inflation factor) command to see how much its inflated
🡪 May also be lots insignificant effects (but you know that JOINTLY it matters)
60
Anika Madaan
🡨 can see he has grouped the ages
together (because separately regression showed that they were insignificant) 🡪 but TOGETHER = significant
2) Education & No. of jobs balanced across race? Do regression to be sure?? YES
61
Anika Madaan
🡨 p-values show no significance
3) What do you make of these results (mean and SD are also the same)?
- If there was any evidence of a systematic relationship between race and any of those characteristics
we could potentially be in trouble when simply comparing interview call backs for different race
groups. Any differences found could simply be due to those other factors rather than racial bias by
employers.
- In the data, there seems to be a clear difference between the races. Whereas for white people call
back rates where above average (9.65%) they were below average for black people (6.45%)
suggesting a racial bias by employers.
4) Linear regression = predict the continuous dependent variable using a given set of independent variables
(mod <- lm(call~black,data))
Logit regression = predict the categorical dependent variable using a given set of independent variables
(mod2 <- glm(call~black,data,family=binomial))
5) Are black people significantly less likely to be employed? And what is the % difference?
6) On the basis of your evidence what can you conclude about racial discrimination in the US labor market?
Do we see a open-and-shut case of racial bias here? Yes No
62
Anika Madaan
Think of the potential caveats and alternative explanations. What analysis could you undertake to address
some of these caveats?
- We see that people with black background tend to have less college education and college
education is another major driver of being employed
- Hence, far from implying a racial issue, the result in (b) could simply reflect employer’s preference
for more highly educated workers. We can examine this by doing the analysis in (b) separately for
workers with different educational attainment
- The results suggest that for either group there is a significant racial gap when it comes to being
employed. Note that the effect is considerably stronger for less educated workers.
o Hence this re-enforces the hypothesis that there is discrimination against workers with black
background which is un-related to their productivity in the workplace.
- However, there might be further caveats: our simple regression cannot account for the quality of
the college education which can vary considerably and might vary systematically along racial lines.
- Furthermore, an important driver for a good education and for various other skills might have to do
with parental income and status.
o Again, this is likely to vary systematically along racial lines.
- While it is interesting to ask if employers discriminate above and beyond what could be expected on
the basis of education and skill of workers – which is what we were implicitly doing above - we
might also be concerned about the overall impact of racial background on labour market outcomes
which includes initially different educational outcomes.
- Hence, depending on our interest we might be primarily focused on the effect of race holding
education fixed or we might be focused on the overall effect.
63
Anika Madaan
🡨 Intercept is 7.1
^ Can see that when women = 0 (so a man) – it is 7.1 (intercept)
But when you add 1 (woman) 🡪 it becomes 4.59 (-2.5)
🡪 Conditional Expectation = another way of expressing it; so the average wage for women (conditional on
something else)
- So if someone is a woman, you would expect a wage of 4.59
- SO the average WAGE of woman = MAN’s WAGE + WOMAN coefficient
- So 7.1 + (-)2.51 = 4.59
Dummies as bars
64
Anika Madaan
🡨 An example of perfect multi-collinearity (because male
and female are perfectly NEGATIVELY correlated; remember male = 0 and female = 1)
🡪 Have dropped one of the variables (female) because of the collinearity
🡪
🡪 For male – need to add male to the constant – the average of the female + the average of the male (the
STEP UP)
65
Anika Madaan
'
• Which dummies we include exactly will affect the interpretation of the coefficients (β 𝑠)
• If we include “constant” and “male” (“female”) then “female” (“male”) becomes the reference
category
• The mean of the reference category is represented by the constant coefficient
• So categorical categorisation of just 2 variables
• But what about more? E.g., countries/nationalities
Sets of dummies
More classification e.g., levels of schooling
🡪 Create new variable:
🡨 Create educats
🡪 educats = 1 when years == 12
🡪 educats = 2 when years >12
🡪 therefore educats =0 when <12 years
< so 116 with <12yrs; 198 with 12 yrs; 212 with >12yrs
🡪 Here there is a clear ordinal representation 🡪 more years = better
🡪 How to run a regression?
🡪 lm command
66
Anika Madaan
🡨 so create 3 dummy variables here
🡪 so variables that are either 0 or 1 e.g., if educats = that number, it is 1; if not, it is 0
Key ADVANTAGE of using DUMMY VARIABLES 🡪 CAN see the DIFFERENT STEP UP vs. STEP DOWN
(because if it was linear – would be the same amount going up and down e.g., a certain number each time)
🡪 But with dummy variables – can see exactly how much you need to go up and down
🡪 The increment is not always the same
67
Anika Madaan
🡨 reference here is edunormal
🡪 so testing that eduLOW is EQUAL to -VE eduHIGH
🡪 That you can go up and down the same amount
🡪 LOOKING AT COEFFICIENTS: if eduLOW was the reference point
- Would have to step UP by 1.3 to edunormal
- And step UP by 3.3 to eduHIGH
🡪 So estimating LINEAR:
68
Anika Madaan
R – how to create more categories?
^ the second one (with 0) gets rid of the constant and instead of seeing the STEP UP, you just see the
average of each category.
^ now have a dummy for normal; high and a dummy for female too!
69
Anika Madaan
^ so when do we just have the constant = not female + not educated so low educated male and everything
else is comparable to that reference group
So to figure out e.g., the wage for a female – it will be the constant + female
To figure out normally educated female – it will be constant + female + normal
70
Anika Madaan
🡨 CVs with “black” sounding names have a 3.2%
lower chance of receiving a call back
^ as year of experience goes down – wage goes down by 0.3% points – buttt if you look at each year
individually; can see different effects sooo could use dummy variables
Non-Linear Relationships
• Relationship between explanatory and dependent variables may be non-linear
• There are general methods to deal with this
• However, in many cases we can avoid using different methods because many types of seemingly
non-linear relationship can be represented in what boils down to a linear regression
• CAN EXPRESS AS A LINEAR MODEL AND USE LINEAR REGRESSION
• e.g. suppose you suspect that the relationship between wage and education in wage1.dta is actually
following a quadratic form:
71
Anika Madaan
2
𝑊𝑎𝑔𝑒 = β0 + β1𝐸𝐷𝑈 + β2𝐸𝐷𝑈 + ϵ
Square Relationship
Log-Linear Relationships
• The most popular non-linear model is probably (EXPONENTIAL)
(
• 𝑌 = exp 𝑒𝑥𝑝 β1 + β2𝑋2 + … + β𝑘𝑋𝑘 + ϵ )
• To make it linear all that is required is to take the (natural) logarithm on both sides of the equation:
• ln 𝑙𝑛 𝑌 = β1 + β2𝑋2 + … + β𝑘𝑋𝑘 + ϵ
• (linear expression of above)
• One of the reasons why it’s popular is the interpretation of the β coefficients it implies
72
Anika Madaan
🡨B=
^^^ so the LOG shows when we change X by 1 unit – Y changes by … much e.g., X% or something so lnY is
the GROWTH RATE
*** when you do ln(1+z) (((basically log of 1 + z))) 🡪 its actually not that different from z itself 🡪 THIS IS
WHAT THE GRAPH IS SHOWING – the lines are basically the same***
73
Anika Madaan
Log-log Model
Internet: This model is handy when the relationship is nonlinear in parameters, because the log
transformation generates the desired linearity in parameters (you may recall that linearity in parameters is
one of the OLS assumptions)
74
Anika Madaan
-
^ a simple way to describe relationship between output/value added/firm of an economy (Y) and their
production factors (labour/employment and capital) and shows you how much the value changes when you
change production factors e.g., if you add more workers 🡪 can take log of the function and this turns it into
a linear function and can do regression (would go up by AlphaL %)
75
Anika Madaan
Summary
• Don’t fall in the dummy variable trap
• The same model can be represented in several ways
• Be careful with interpretation of dummies
• A lot of stuff that looks non-linear at first glance is linear after all
Extra: Interactions
Then the normal for example becomes: BETA female + BETAfem X norm:
CAN DO THIS ON R
76
Anika Madaan
🡨 so low is the reference; can see that normal
means higher salary on average and high means even higher salary on average
^ can also see that for female gender – there is an effect and it is even bigger for the normal group
- E.g., in the normal group, it is almost 1.7 lower (-1.3 + -1.3)
- But no significance so could interpret it as saying, lets go back to our original model because the
difference in education is not that significant
77
Anika Madaan
^ so the model expresses that experience goes up in a linear way (hence why you multiple by exper)
(exercise 6 to complete)
^ any model (linear) 🡪 Want to estimate BETA – how X is affected Y but there are other factors that are
independent of X that are affecting Y
🡪 Part of the reason is that there are probability MULTIPLE factors e.g., wages because of education, like
studying, richer country etc. that are driving X
Instrumental variables 🡪 if you can identify at least one of these factors that is independent of Epsilon –
can then use it to potentially find an UNBIASED or CONSISTENT version of BETA
78
Anika Madaan
2 Stage Least Squares Estimator (2 SLS)
^ so if you find such a variable, Z, (Pi can be replaced with any Greek letters)
- Figure out the relationship between X and Z
- Find the estimated relationship
- Then find the relationship between Y and Xhat
- The only thing that can move the Xhat up or down is the Z (if Z goes UP or DOWN)
- So Xhat will NOT be correlated with EPSILON (because Z is not correlated with EPSILON)
🡪 Academic Talent: thinking we would get an over effect from this variable (because has a +ve effect on
both education and the epsilon)
🡪 Super Nerd: Studying really hard may also have an effect – would have a negative effect on epsilon
because you don’t necessarily make loads of money
🡪 Another factor that affects is cost of attending – could be affected by closeness to college e.g.,
international vs. local student (has nothing to do with talent or nerd or any other factors affecting wage)
- This factor is attractive because also can find data on it
- So use the DISTANCE instrument
1ST STAGE
79
Anika Madaan
^ The F statistic of instruments in 1st stage should be LARGER THAN 10 🡪 here it is 88
2ND STAGE
🡪 Take the estimates from the first stage – PREDICT X VALUES 🡪 THEN USE X VALUES IN NORMAL
REGRESSION
🡪 ivreg() command
🡪
🡪 If there is not much change, then you know that it was a good story but not a good instrument
A graphical representation of IV
3 Key Criteria
1. The problem could be that X is causing Epsilon – that they are RELATED
- So need to make sure instrument is INDEPENDENT (Epsilon can be driving X but should NOT be
driving Z)
- Can argue for this criteria
2. Is easy to find things that have nothing to do with EPSILON but Z must be driving X and must be driving
X quite STRONGLY
- Can outright check with the iv command (look separately at the FIRST STAGE to see if Z is super
significant as a DRIVER OF X)
3. The epsilon mustn’t be driving the Z but also the Z mustn’t be driving the EPSILON
- The only way that Z should affect Y is through X
80
Anika Madaan
^ How can these criterion be violated?
- Lots of colleges in London 🡪 London obviously higher salaries
- Big city colleges 🡪 may draw more high flyers
- May have to include control variables
Control Variables
^ Then instead of being not correlated with ANYTHING – it will just not be correlated with the CONTROL
variables
81
Anika Madaan
^ added regional control variables
^^ changed from 114 to 98 so tells us that there is something in this story
🡪 Now lets see if criterion 1 is still working – that the instrument is still really SIGNIFICANT even after
including all the control variables
🡪 >10
NOTE: reduced form = regressing the outcome on instrument (seeing how they are actually related e.g.,
closeness effect won’t affect people that didn’t actually go to college)
82
Anika Madaan
Weak instrument problem
🡪 Need a strong relationship between Z and X in the denominator (so it will be a large number)
🡪 Because a small denominator can inflate the numerator – will be BIASED
🡪 Can avoid if you have a strong first stage – a STRONG INSTRUMENT
- We want first stage F statistic to be LARGE
83
Anika Madaan
- So family size doesn’t seem to affect (is actually a good finding)
Multiple Instruments
^ e.g., Closeness of college + which college + how much college + type of education etc.
🡪 So, need good causal estimates of all of them
🡪 So, if two exogenous X variables then at least 2 INSTRUMENTS (at last)
- Otherwise you wouldn’t know which X variable is being affected by the instrument
Summary
• Endogeneity is often a problem: X(ϵ)
• However, X is also driven by other factors
• If we can find data on at least one other factor Z which is independent of ϵ we can do 2SLS IV
• Can combine with using various other controls to make it more plausible that remaining error ϵ is
indeed independent of Z
• Need to ensure strong first stage
• Finding IVs is a bit of an art
(exercise 7 to complete)
84
Anika Madaan
Time Series data: Different data points represent different points in time 🡪 This introduces some additional
challenges
^ red line = covid and there is just a sharp increase in 2019 – lets test regression
🡪 is 5% and is significant!
🡪 NOTE: log of economic activity index – because log of coefficient allows you to have Y change in
percentage (e.g., usually if you increase X, there is a unit increase in Y) but this means % increase in Y
85
Anika Madaan
What if time is not linear?
So including a time trend shows whether something grows/shrinks continuously but what about a one-off
event that moves variables in a direction e.g., recession, pandemic
Panel Data = time series data AND cross-sectional data TOGETHER e.g., time data for multiple countries
^ data above shows weekly data for US states so several observations
🡪 Can introduce a dummy that captures the week
Autoregression
• Not only could time be a confounding factor but also an assumption that we are making implicitly is
that shocks are not related from one observation to the next
• A particular concern in time series is the possibility that observations are correlated over time
• Simplest way to model this is via an Auto regression:
• 𝑌𝑡 = β0 + β1𝑌𝑡−1 + ϵ𝑡
• 𝑌𝑡−1 becomes the X variable (this is basically the dependant variable in the unit before e.g.,
the week before)
• “the value of today depends on the value of yesterday and some randomness we can’t
predict”
86
Anika Madaan
• So include the past as a specific variable
• We can do normal OLS as long as − 1 < β1 < 1
• With β = 1 we have non-stationarity because of path dependence
• The series can wander off into any direction and never come back
• If that happens OLS is no longer un-biased (different observations are too related to each other)
• Also: if you are interested in 𝑌 = β𝑋 and both Y and X have unit roots you will have a spurious
correlation (the unit root becomes the confounder)
• Random Walk
• Of course we don’t know if this is the case in our data before we start any analysis
• INTERNET: WHAT IS A UNIT ROOT?
• = stochastic (random probability distribution) trend in time series
• Unit roots (when = 1) give us insights into whether time series will recover to its expected
value & if not, then susceptible to shocks and hard to predict and control
Internet: Stationarity
🡪 = an important characteristic of time series
- Time series said to be stationary if statistical properties do not change over time
o So constant mean and variance
- There is a statistical test that we can run to determine if series is stationary or not
o = DICKEY FULLER TEST
o Tests the null hypothesis that a unit root is present
o If it is then p>0 and process is not stationary
o Otherwise p=0 and null hypothesis is rejected and process is stationary
Dickey-Fuller Test
Using R
library(urca) 🡪 ur.df () command (used to examine a series in dataframe)
⇨ UR dickey fuller
87
Anika Madaan
⇨
⇨ Dickey fuller test provides us with new critical values to compare test statistic to
So, time series are used a lot in central banks e.g., to predict growth in next week, year, quarter etc. BUT
need to check for autoregression if it is too much dependant on the history
88
Anika Madaan
^ e.g., blue line may look like an upward trend but it’s actually a unit root
We need to difference the series to get rid of the unit root – Another EXAMPLE:
🡨 After differencing
So with time series – need to worry about spurious effects due to time trends or unit roots – so before you
run a regression of a series – need to make sure the series is stationary.
Summary
89
Anika Madaan
⇨ Time series can be easy
⇨ If the series clearly grows or shrinks continuously definitely include a time trend
⇨ However, even if it doesn’t grow (or shrink) the series might contain a unit root
⇨ Use the Dickey Fuller Test to make sure you are dealing with a stationary series
Examples:
- Predicting whether a picture is a picture of a cat or a dog to prevent spam on social network for cat
owners
- Predicting if a mushroom is toxic or not based on a picture
- Predicting if a person is a republican or a democrat based on demographics
or...
* Similar is a big necessary assumption here, the model learns from the data so if the data is not
representative the model will not work well or even completely wrong
Survived - SEX
90
Anika Madaan
Survived – Age + Class + Sex
Beyond accuracy
Evaluating how often and how is my model wrong
91
Anika Madaan
Confusion Matrix
In R:
- The under fit doesn’t match the model but the over fit is WAY too complicated
- Think about adding one new point to the graph, how well will the model perform?
- The more complex the model is, the more likely it is to overfit. Overfitting is a huge issue as the
model will seem to perform great when training it and then perform poorly when applied.
92
Anika Madaan
🡨 so the MORE variables you
add, you will IMPROVE the fit of the model to the dataset! Same with R^2
How to do this?
93
Anika Madaan
^ split the data – do it randomly
Decision Tree
94
Anika Madaan
🡨 you can stop when you can find an
improvement of GINI more than 0.01
95
Anika Madaan
Lecture 10: Loose Ends
96
Anika Madaan