Pse2023 1
Pse2023 1
Federico Crudu
University of Siena
AA 2023-24
AA 2023-24 1 / 453
What is this course about?
1 Introduction to programming in R
2 Introduction to the tidyverse
3 Applications from economics and econometrics: optimization, Monte
Carlo simulation and the bootstrap, causality, estimation of structural
models, networks, growth models.
AA 2023-24 2 / 453
Introducing R
We will introduce R, its basic features and we will see how to apply it to
simple data problems first.
AA 2023-24 3 / 453
Introducing R
R is many things:
1 an effective data handling and storage facility
2 a suite of operators for calculations on arrays (vectors, matrices)
3 integrated collection of tools for data analysis
4 graphical facilities for data analysis
5 a programming language
AA 2023-24 4 / 453
Introducing R
AA 2023-24 5 / 453
Introducing R
RStudio allows to run Python code and also Julia, although not as
smoothly. Alternatives are Jupyter, Atom, VS Code.
AA 2023-24 6 / 453
Introducing R
Help!!!
One of the most important and more frequently used function is the help
function. Suppose we want to know what the function solve does
help(solve)
Alternatively
?solve
help("if")
??matrix
AA 2023-24 7 / 453
Introducing R
Help!!!
Probably the best way to get help is to use Google and/or AI chats.
AA 2023-24 8 / 453
Introducing R
Commands
The R language is case sensitive, this means that vector and Vector are
two different things.
1+1 # expression
[1] 2
a<-1+1 # assignment
a
[1] 2
AA 2023-24 9 / 453
Introducing R
Commands
Notice from the previous slide that if we want to comment a line of code
you can use the symbol #
# this is nothing
1+1; a<-1+1
[1] 2
AA 2023-24 10 / 453
Introducing R
Source files and diverting output
source("ABunchOfCommands.R")
If you want to drop something in a file, you can use the following
AA 2023-24 11 / 453
Introducing R
Data permanency and removing objects
Objects are the entities that R manipulates and they may be of various
nature (variables, vectors, character strings, functions,...). To see which
objects are present in an R session we type
objects()
AA 2023-24 12 / 453
Introducing R
Vectors and assignment
Here we have used an assignment, the symbol <-, and the function c().
Notice that sometimes the symbol = is used, I would discourage that.
AA 2023-24 13 / 453
Introducing R
Vectors and assignment
y <- c(x, 0, x)
y
[1] 10.4 5.6 3.1 6.4 21.7 0.0 10.4 5.6 3.1 6.4
[11] 21.7
Can you use the assignment symbol in reverse order, i.e. ->?
AA 2023-24 14 / 453
Introducing R
Vector arithmetic
v <- 2*x + y + 1
v; x; y
[1] 32.2 17.8 10.3 20.2 66.1 21.8 22.6 12.8 16.9 50.8
[11] 43.5
[1] 10.4 5.6 3.1 6.4 21.7
[1] 10.4 5.6 3.1 6.4 21.7 0.0 10.4 5.6 3.1 6.4
[11] 21.7
AA 2023-24 16 / 453
Introducing R
Vector arithmetic
We can perform simple operations that are useful for statistical work
mean(x); var(x)
[1] 9.44
[1] 53.9
sum(x)/length(x); sum((x-mean(x))^2)/(length(x)-1)
[1] 9.44
[1] 53.9
AA 2023-24 17 / 453
Introducing R
Vector arithmetic
Problem
Sometimes you need to sort your data in a given order. You can do this in
a number of ways. Try to see ?sort and ?order.
AA 2023-24 18 / 453
Introducing R
Generating regular sequences
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
[18] 18 19 20 21 22 23 24 25 26 27 28 29 30
[1] 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
[1] 2 3 4 5 6 7 8 9 10 11 12 13 14 15
[1] 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
[1] 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
AA 2023-24 19 / 453
Introducing R
Generating regular sequences
Problem
Define n <- 10. Try to see what happens if you write 1:n-1 and
1:(n-1). Think also of a way of building a backward sequence.
AA 2023-24 20 / 453
Introducing R
Generating regular sequences
The function seq is more general than the colon operator. It takes five
arguments (whose order is irrelevant if you name them).
[1] 1 2 3 4 5 6 7 8 9 10
[1] 1 2 3 4 5 6 7 8 9 10
seq(10,1); seq(1,10,by=.5)
[1] 10 9 8 7 6 5 4 3 2 1
[1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5
[11] 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0
AA 2023-24 21 / 453
Introducing R
Generating regular sequences
You may check other uses of seq and the function rep
?seq; ?rep
AA 2023-24 22 / 453
Introducing R
Logical vectors
A logical vector takes values TRUE, FALSE and NA for “not available”.
The values in temp tell us if the logical condition is met. Notice that such
condition is applied elementwise.
AA 2023-24 23 / 453
Introducing R
Logical vectors
The logical operators are <, <=, >, >=, ==, !=, where their meaning
should be obvious.
If you have two logical expressions, say, c1 and c2 you can operate on them
using the Boolean operators & that stands for and and | that stands for or.
AA 2023-24 24 / 453
Introducing R
Logical vectors
AA 2023-24 25 / 453
Introducing R
Missing values
Missing values are ubiquitous in data sets. In R they are denoted with NA.
If we want to inspect an object to see if it is a missing we may use is.na
z<-c(1:3,NA); ind<-is.na(z);ind
AA 2023-24 26 / 453
Introducing R
Missing values
You may inquiry about the position of NA and you can remove it too
ind2<-which(is.na(z))#;ind2
z[-ind2]
[1] 1 2 3
AA 2023-24 27 / 453
Introducing R
Missing values
0/0
[1] NaN
AA 2023-24 28 / 453
Introducing R
Missing values
Problem
Notice
a<-Inf-Inf
if(is.nan(a)==T){print("we do not have number here!")}
AA 2023-24 29 / 453
Introducing R
Character vectors
Character vectors are used very often, in plot labels for example. We write
them as text between quotes: "x-values" or "New iteration
results". Let see an interesting example
c("X1", "Y2", "X3", "Y4", "X5", "Y6", "X7", "Y8", "X9", "Y10")
Problem
Try changing the terms of the problem in the generation of labs. Check
what happens if you add a further variable "Z".
AA 2023-24 31 / 453
Introducing R
Modifying vectors
We will explore now some ways to subset a vector. Suppose we want to get
rid of NAs
y <- x[!is.na(x)];y
length(x)>length(y)
[1] FALSE
AA 2023-24 32 / 453
Introducing R
Modifying vectors
There are many other ways to do what we saw in the previous slide. You
may want to explore the following situations
AA 2023-24 33 / 453
Introducing R
Modifying vectors
Problem
What about the following example? What do you observe?
Have you noticed the role played by [] in the examples seen so far?
AA 2023-24 34 / 453
Introducing R
Other objects
Despite the prominence of vectors there are other objects that are
extremely useful
1 matrices (or arrays)
2 factors to handle categorical data
3 lists, a generalization of the concept of vector that can contain objects
of not necessarily the same type or size, often used to store the output
of computations
4 data frames, like matrices but can store numerical and categorical
variables
5 functions, this is a story apart.
AA 2023-24 35 / 453
Introducing R
Objects: modes and attributes
The things or entities that live in R are called objects. Such objects have
special features.
For example, vector objects may contain numeric, logical or string values.
If the components are all of the same type we refer to that as an atomic
structure.
AA 2023-24 36 / 453
Introducing R
Objects: modes and attributes
a1<-c(1,2,3);mode(a1)
[1] "numeric"
a2<-c(1,2,"yeah");mode(a2)
[1] "character"
a3<-c(1,2,NA);mode(a3)
[1] "numeric"
AA 2023-24 37 / 453
Introducing R
Objects: modes and attributes
Problem (typeof)
Replicate the above examples using typeof.
AA 2023-24 38 / 453
Introducing R
Objects: modes and attributes
z <- 0:9
digits <- as.character(z)
d <- as.integer(digits)
z;digits;d # accidentally z and d are the same
[1] 0 1 2 3 4 5 6 7 8 9
[1] "0" "1" "2" "3" "4" "5" "6" "7" "8" "9"
[1] 0 1 2 3 4 5 6 7 8 9
AA 2023-24 39 / 453
Introducing R
Changing the length of an object
numeric(0)
e[3] <- 17
e
[1] NA NA 17
AA 2023-24 40 / 453
Introducing R
The class of an object
[1] "integer"
[1] "logical"
a3<-matrix(rnorm(10),5,2);class(a3)
AA 2023-24 41 / 453
Introducing R
Ordered and unordered factors
AA 2023-24 42 / 453
Introducing R
Ordered and unordered factors
Let’s say that this is a vector of 30 tax accountants from all Australian
states.
AA 2023-24 43 / 453
Introducing R
Ordered and unordered factors
Also
unique(state)
AA 2023-24 44 / 453
Introducing R
Ordered and unordered factors
Suppose we have the incomes of the tax accountants in another vector and
we want to make some operation.
incomes <- c(60, 49, 40, 61, 64, 60, 59, 54, 62, 69, 70, 42,
56, 61, 61, 61, 58, 51, 48, 65, 49, 49, 41, 48, 52, 46,
59, 46, 58, 43)
We want to know the mean per state and we use the function tapply
# check
?tapply
AA 2023-24 45 / 453
Introducing R
Ordered and unordered factors
Suppose we have the incomes of the tax accountants in another vector and
we want to make some operation.
AA 2023-24 46 / 453
Introducing R
Ordered and unordered factors
AA 2023-24 47 / 453
Introducing R
Ordered and unordered factors
for(i in 1:length(uni)){
n.per.state[i]<- sum(1*(state==uni[i]))
}
names(n.per.state) <- uni
n.per.state # accountants per state
This is the number of accountants per state. So we say that our array is
irregular or ragged. We applied the mean operator to each factor as if they
were different vectors.
AA 2023-24 48 / 453
Introducing R
Ordered and unordered factors
AA 2023-24 49 / 453
Introducing R
Ordered and unordered factors
AA 2023-24 50 / 453
Introducing R
Ordered and unordered factors
Problem
Devise a function that computes the t test and replicate the analysis we
have carried out in the previous slide.
AA 2023-24 51 / 453
Introducing R
Ordered and unordered factors
AA 2023-24 52 / 453
Introducing R
Arrays and matrices
a[2,,]
a[,,]
AA 2023-24 53 / 453
Introducing R
Arrays and matrices
Problem
Consider the following array
How can we extract the elements x[1,3], x[2,2], x[3,1] and replace
them with zeros?
AA 2023-24 54 / 453
Introducing R
Arrays and matrices
[,1] [,2]
[1,] 1 3
[2,] 2 2
[3,] 3 1
AA 2023-24 55 / 453
Introducing R
Arrays and matrices
Notice that
x[i]
[1] 9 6 3
AA 2023-24 56 / 453
Introducing R
Arrays and matrices
x[i]<-0
x
AA 2023-24 57 / 453
Introducing R
Arrays and matrices
?array
AA 2023-24 58 / 453
Introducing R
Arrays and matrices
An important operation for arrays is the outer product. Given two arrays
the outer product is
ab <- a %o% b
# alternatively
ab <- outer(a, b, "*")
If a and b are two numeric arrays, their outer product is an array whose
dimension vector is obtained by concatenating their two dimension vectors,
and whose data vector is got by forming all possible products of elements
of the data vector of a with those of b.
AA 2023-24 59 / 453
Introducing R
Arrays and matrices
a<-matrix(rnorm(6),3,2)
b<-c(1,2)
ab <- a %o% b;dim(ab)
[1] 3 2 2
AA 2023-24 60 / 453
Introducing R
Arrays and matrices
ab
, , 1
[,1] [,2]
[1,] 0.036 2.512
[2,] -0.774 -0.681
[3,] -0.279 -0.814
, , 2
[,1] [,2]
[1,] 0.0721 5.02
[2,] -1.5473 -1.36
[3,] -0.5571 -1.63
AA 2023-24 61 / 453
Introducing R
Arrays and matrices
We can perform matrix operations similarly to the scalar case, given the
caveats that come from conformability and lack of commutativity for
products.
AA 2023-24 62 / 453
Introducing R
Arrays and matrices
Notice that the operation t(X)%*%y can be more efficiently performed with
crossprod(X,y).
AA 2023-24 63 / 453
Introducing R
Arrays and matrices
Problem (Ik )
Let k be a number. What happens if you compute diag(k)?
AA 2023-24 64 / 453
Introducing R
Arrays and matrices
AA 2023-24 65 / 453
Introducing R
Arrays and matrices
AA 2023-24 66 / 453
Introducing R
Lists and data frames
Notice that components are always numbered and we can extract them
using the corresponding number value
Lst[[1]];Lst[[2]];Lst[[3]];Lst[[4]]
[1] "Fred"
[1] "Mary"
[1] 3
[1] 4 7 9
AA 2023-24 67 / 453
Introducing R
Lists and data frames
Lst[[1]][1];Lst[[2]][2];Lst[[3]][2];Lst[[4]][2]
[1] "Fred"
[1] NA
[1] NA
[1] 7
AA 2023-24 68 / 453
Introducing R
Lists and data frames
We can also access the components using their corresponding name via the
operator $
Lst$name;Lst$wife;Lst$child.ages[1]
[1] "Fred"
[1] "Mary"
[1] 4
AA 2023-24 69 / 453
Introducing R
Lists and data frames
Lst[[1]]
[1] "Fred"
Lst[1]
$name
[1] "Fred"
The former is the first object in the Lst, and if it is a named list the name
is not included. The latter is a sublist of the list Lst consisting of the first
entry only. If it is a named list, the names are transferred to the sublist.
AA 2023-24 70 / 453
Introducing R
Lists and data frames
AA 2023-24 71 / 453
Introducing R
Lists and data frames
AA 2023-24 72 / 453
Introducing R
Lists and data frames
If the restrictions above are met we may use those objects to construct a
data.frame
AA 2023-24 73 / 453
Introducing R
Lists and data frames
head(accountants)
AA 2023-24 74 / 453
Introducing R
Lists and data frames
AA 2023-24 75 / 453
Introducing R
Lists and data frames
There are many ways to read a data frame. We will revise some basic
approaches.
When reading a data frame from an external file you expect it to have a
special form
the first line of the file should have a name for each variable in the
data frame
each additional line of the file has as its first item a row label and the
values for each variable.
AA 2023-24 76 / 453
Introducing R
Lists and data frames
If you have a data set, say, houses.data, you can import it using
provided that the path is the right one. It is good practice to check for
examples that work and
?read.table
AA 2023-24 77 / 453
Introducing R
Lists and data frames
R contains packages that contain data sets. In this case you can use the
function data
#install.packages("datasets")
library(datasets)
data(infert)
#head(infert)
AA 2023-24 78 / 453
Introducing R
Lists and data frames
Problem
Could you get data from an URL?
AA 2023-24 79 / 453
Introducing R
Lists and data frames
More modern ways to upload data sets from outside sources are provided
by the packages
library(foreign)
and in particular by
library(haven)
AA 2023-24 80 / 453
Introducing R
Probability distributions
rnorm(1)
[1] -0.858
AA 2023-24 81 / 453
Introducing R
Probability distributions
[1] 7.06
AA 2023-24 82 / 453
Introducing R
Probability distributions
The prefixes q, p, d indicate the quantile, the pdf and the cdf respectively
[1] 0.0303
[1] 9.55
AA 2023-24 83 / 453
Introducing R
Probability distributions
Problem
What does it mean lower.tail = FALSE in the above example?
AA 2023-24 84 / 453
Introducing R
Probability distributions
Histogram of x
0.30
0.25
0.20
Density
0.15
0.10
0.05
0.00
−2 −1 0 1 2 3
AA 2023-24 85 / 453
Introducing R
Probability distributions
Histogram of x
0.30
0.25
0.20
Density
0.15
0.10
0.05
0.00
−2 −1 0 1 2 3
AA 2023-24 86 / 453
Introducing R
Probability distributions
Problem
Repeat the above example by first increasing the sample size and then
using asymmetric distributions. Find a way to incorporate the true
probability density in your graph.
AA 2023-24 87 / 453
Introducing R
Grouping, loops and conditional execution
AA 2023-24 88 / 453
Introducing R
Grouping, loops and conditional execution
In fact
b.hat
[1] 3.97
AA 2023-24 89 / 453
Introducing R
Grouping, loops and conditional execution
Notice that the value of the whole thing is the last line...this is a bit
confusing...Let’s change the example slightly
sim.ols
AA 2023-24 90 / 453
Introducing R
Grouping, loops and conditional execution
Why do we care?
AA 2023-24 91 / 453
Introducing R
Grouping, loops and conditional execution
We first consider the function ifelse, the statement if and the function
switch
pizza.ingredients<-c("flour","water","oil",
"tomato","mozzarella","pineapple")
ifelse("pineapple" %in% pizza.ingredients,
"THE END IS NIGH!!!",
"You are on the right track")
AA 2023-24 92 / 453
Introducing R
Grouping, loops and conditional execution
AA 2023-24 93 / 453
Introducing R
Grouping, loops and conditional execution
[1] "ARRGHHH!"
AA 2023-24 94 / 453
Introducing R
Grouping, loops and conditional execution
Problem
Prevent Armageddon by replacing pineapple with basil. Also add
gorgonzola and sausage.
AA 2023-24 95 / 453
Introducing R
Grouping, loops and conditional execution
a.number<-sample(c(1,2))[1]
switch(a.number,"number one","number two")
AA 2023-24 96 / 453
Introducing R
Grouping, loops and conditional execution
require(stats)
centre <- function(x, type) {
switch(type,
mean = mean(x),
median = median(x),
trimmed = mean(x, trim = .1))
}
x <- rcauchy(1000)
centre(x, "mean");centre(x, "median");centre(x, "trimmed")
[1] -0.879
[1] -0.0389
[1] -0.0828
AA 2023-24 97 / 453
Introducing R
Grouping, loops and conditional execution
The for loop is one of the most common methods to run repeated
operations.
If we run a two tail 5% test and we could repeat this experiment many
times our test would reject the null hypothesis 5% of the times.
AA 2023-24 98 / 453
Introducing R
Grouping, loops and conditional execution
n<-100;reps<- 1000
b2.hat<-t.test<-matrix(NA,reps,1)
for(i in 1:reps){
# simulate the data
x<-rnorm(n);e<-rnorm(n);y<- 1+2*x+e;X<-cbind(1,x)
# estimate the parameters
b.hat<-solve(crossprod(X))%*%crossprod(X,y)
e.hat<- y-X%*%b.hat
var.b<-as.numeric(crossprod(e.hat)/(n-2))*
solve(crossprod(X))
b2.hat[i]<-b.hat[2]
t.test[i]<-(b2.hat[i]-2)/sqrt(var.b[2,2])
}
AA 2023-24 99 / 453
Introducing R
Grouping, loops and conditional execution
t.size<-mean(1*(abs(t.test)>qnorm(0.975)))
t.size # simulated type I error for a 5% nominal size test
[1] 0.053
This seems interesting. So, what does that actually say? And why do we
care?
Notice that our choices for the design of the data generating process
(DGP) match those of the classical linear regression model (why?).
Problem
Try to run a similar experiment but this time consider different distributions
for the error term, e.g., a t distribution with 5 degrees of freedom and a
Cauchy distribution. In addition, using the same setup study the properties
of the 95% confidence interval for the same parameter. What do you
observe?
Other important functions are while and repeat. The former executes an
expression until the requirements of a logical condition are met. For
example
i<-0
while(i<=10){i<-i+1; if(i==10){print("we are done now")}}
Notice that R is a highly vectorized language. This means that you can
often work out your problem by using vector operations instead of loops.
This generally makes operations more time efficient.
While there are uncountable packages that perform the most diverse array
of operations, you may still find writing your own function useful.
The above code retrieves a data set on Californian schools from the
package AER
# what's in there?
# head(CASchools)
# str(CASchools)
names(CASchools)
my.ols<-function(y,x,const){
y<-as.matrix(y)
x<-as.matrix(x)
if(const==TRUE){
X<-cbind(1,x)
}else{X<-x}
b.hat<-solve(t(X)%*%X)%*%(t(X)%*%y)
b.hat
}
ols1<-my.ols(y=CASchools$score,x=CASchools$STR,const=TRUE)
ols1
[,1]
[1,] 698.93
[2,] -2.28
Problem
Integrate the output of my.ols with a t test for the null hypothesis that β1
and β2 are zero and the corresponding p-values.
There are some features that we may find useful when working with
function.
Notice that when we specify the arguments their order does not matter.
my.ols(y=CASchools$score,x=CASchools$STR,const=TRUE)
[,1]
[1,] 698.93
[2,] -2.28
my.ols(x=CASchools$STR,y=CASchools$score,const=TRUE)
[,1]
[1,] 698.93
[2,] -2.28
my.ols(CASchools$score,CASchools$STR,TRUE)
[,1]
[1,] 698.93
[2,] -2.28
my.ols(y=CASchools$score,x=CASchools$STR)
my.ols(CASchools$score,TRUE,CASchools$STR)
In many cases our function may call other functions for which we may need
to specify its associated arguments. This can be done using ..., for
example
We do not need to specify the arguments for the graphical device unless we
want to.
Note that any assignment within the function is local to the function.
We are tackling this aspect only marginally to dedicate more time to the
study of the tidyverse packages.
One of the reasons of R’s success is its ability to produce beautiful graphs.
−2
−4
1 2 3 4 5 6 7
log(subs)
You can for example change the points into lines by specifying type="l" in
the function plot. Further changes may be applied also via the function
par. Check
?plot
?par
Library subscriptions
8
6
log(subs)
4
2
0
−6 −4 −2 0 2 4
log(citeprice)
Econometrica
log(subs)
4
2
0
−6 −4 −2 0 2 4
log(citeprice)
We can alter the plot by using further functions such as lines, points,
legend, abline.
More specific plots are also available: barplot, boxplot, qqplot, hist.
demo("graphics")
1 (x−µ)2
f(x) =
−
0.3
e 2
2σ
σ 2π
dnorm(x)
0.2
0.1
0.0
−4 −2 0 2 4
?optimize
?optim
You can also check the specific optimization task view in R (link) to check
for new developments on the relevant packages.
−1
−2
0 1 2 3 4 5 6
It depends.
$minimum
[1] 3.03
$objective
[1] -1.05
$maximum
[1] 4.06
$objective
[1] 1.1
Let us see whether we can apply this approach with a more familiar
problem, the OLS estimator.
yi = α + βxi + ε
We now that this problem has a close form solution but for the sake of the
example we try to solve the problem numerically.
Q<-function(theta){
theta<-as.vector(theta)
y<-as.matrix(y)
x<-as.matrix(x)
X<-cbind(1,x)
e<-y-X%*%theta
Q<-crossprod(e)
}
attach(Journals)
y<-log(subs); x<-log(citeprice)
ols2<-optim(c(0,0),Q,method = "BFGS");ols2$par
lm(y~x)$coefficients
(Intercept) x
4.766 -0.533
gauss.ll<-function(theta,x){
mu<-theta[1]
sig2<-theta[2]
x<-as.matrix(x)
n<-nrow(x)
ll<- -.5*n*log(2*pi)-.5*n*log(sig2)-sum((x-mu)^2)/(2*sig2)
return(-ll)
}
set.seed(12345)
x<-rnorm(100,1,2)
gauss.ll.opt<-optim(c(1,4),gauss.ll,x=x)
gauss.ll.opt$par
gauss.ll2<-function(theta,x){
mu<-theta[1]
sig<-theta[2]
x<-as.matrix(x)
n<-nrow(x)
z<-(x-mu)/sig
ll<- -n*log(sig)+sum(dnorm(z,log=TRUE))
return(-ll)
}
set.seed(12345)
x<-rnorm(100,1,2)
gauss.ll.opt2<-optim(c(1,4),gauss.ll2,x=x)
gauss.ll.opt2$par
set.seed(12345)
x<-rnorm(100,1,2)
gauss.ll.opt2<-optim(c(1,4),gauss.ll2,x=x, hessian = TRUE)
gauss.ll.opt2$par
[,1] [,2]
[1,] 4.92e-02 -4.05e-06
[2,] -4.05e-06 2.46e-02
se<-sqrt(diag(FI));se
This example is from Kleiber and Zeileis, see also Zellner and Ryu (1998)
and Greene (2003, Chapter 17).
∂εi 1+θYi
Notice that ∂Yi = Yi . The loglikelihood is then
n n ε
i
X X
ℓ= (log(1 + θYi ) − log(Yi )) − log ϕ .
σ
i=1 i=1
sqrt(diag(solve(opt.prodf.ll$hessian)))[1:4]
-opt.prodf.ll$value
[1] -8.94
The goal of this part is to introduce some concepts of modern data science.
The main reference for this section is R for Data Science by Wickham and
Grolemund. This is a great book, clearly written and free.
#install.packages("tidyverse")
library(tidyverse)
Let us use the car dataset mpg. We want to ask the following question: do
cars with big engines use more fuel than cars with small engines?
#?mpg
mpg
# A tibble: 234 x 11
manufacturer model displ year cyl trans drv
<chr> <chr> <dbl> <int> <int> <chr> <chr>
1 audi a4 1.8 1999 4 auto~ f
2 audi a4 1.8 1999 4 manu~ f
3 audi a4 2 2008 4 manu~ f
4 audi a4 2 2008 4 auto~ f
5 audi a4 2.8 1999 6 auto~ f
6 audi a4 2.8 1999 6 manu~ f
7 audi a4 3.1 2008 6 auto~ f
8 audi a4 quatt~ 1.8 1999 4 manu~ 4
9 audi a4 quatt~ 1.8 1999 4 auto~ 4
10 audi a4 quatt~ 2 2008 4 manu~ 4
# i 224 more rows
# i 4 more variables: cty <int>, hwy <int>, fl <chr>,
# class <chr>
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
40
30
hwy
20
2 3 4 5 6 7
displ
ggplot(data = mpg)
We used a function geom_point to map points into the graph. This is but
one of the possible so called geom functions.
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
Let us look again at the plot: at the right hand side there are cars that are
surprisingly efficient.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
40
30
hwy
20
2 3 4 5 6 7
displ
They may be hybrid cars. Let us look at the variable class and map it to
the aesthetic color.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy,
color = class))
40
class
2seater
30 compact
midsize
hwy
minivan
pickup
subcompact
suv
20
2 3 4 5 6 7
displ
Well, they turn out to be sport cars and not hybrids. This is because they
have a big engine but a small body which improves their efficiency.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy,
size = class))
40
class
2seater
30 compact
midsize
hwy
minivan
pickup
subcompact
suv
20
2 3 4 5 6 7
displ
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy,
alpha = class))
40
class
2seater
30 compact
midsize
hwy
minivan
pickup
subcompact
suv
20
2 3 4 5 6 7
displ
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy,
shape = class))
40
class
2seater
30 compact
midsize
hwy
minivan
pickup
subcompact
suv
20
2 3 4 5 6 7
displ
Warning: The shape palette can deal with a maximum of 6 discrete values
because more than 6 becomes difficult to discriminate; you have 7.
AA 2023-24 168 / 453
Introducing the tidyverse
Visualization
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy),
color = "blue")
40
30
hwy
20
2 3 4 5 6 7
displ
AA 2023-24 169 / 453
Introducing the tidyverse
Visualization
Another way to add categorical variables into our plot is to use facets
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ class, nrow = 2)
2seater compact midsize minivan
40
30
20
hwy
2 3 4 5 6 7
pickup subcompact suv
40
30
20
2 3 4 5 6 7 2 3 4 5 6 7 2 3 4 5 6 7
displ
# points, as before
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
40
30
hwy
20
2 3 4 5 6 7
displ
30
hwy
25
20
2 3 4 5 6 7
displ
35
30
drv
25
4
hwy
f
r
20
15
2 3 4 5 6 7
displ
30
hwy
25
20
2 3 4 5 6 7
displ
35
30
25
hwy
20
15
2 3 4 5 6 7
displ
30
25
hwy
20
15
2 3 4 5 6 7
displ
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
geom_smooth(mapping = aes(x = displ, y = hwy))
40
30
hwy
20
2 3 4 5 6 7
displ
This is nice, but we see that basically we duplicate code. There is a more
efficient approach.
40
30
hwy
20
2 3 4 5 6 7
displ
40
class
2seater
30 compact
midsize
hwy
minivan
pickup
subcompact
suv
20
2 3 4 5 6 7
displ
40
class
2seater
30 compact
midsize
hwy
minivan
pickup
subcompact
suv
20
2 3 4 5 6 7
displ
dim(diamonds)
[1] 53940 10
names(diamonds)
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))
20000
15000
count
10000
5000
Notice that the data set does not contain counts for our diamonds.
Some geoms plot the raw values, other plots, on the other hand, apply
some modification to the data to get the end result.
The algorithm used to calculate new values for a graph is called a stat.
This is typical of histograms, pie charts and any other graph that requires a
transformation of the data.
To every geom is associated a stat and the other way around, this makes
the production of a specific plot (say an histogram) automatic.
There are though some situations where you may consider using a stat
explicitly.
1 You do not like the default result and you want to override the default
stat.
ggplot(data = demo) +
geom_bar(mapping = aes(x = cut, y = freq),
stat = "identity")
20000
15000
freq
10000
5000
Notice that, in a way, the last plot is conceptually more similar to a scatter
plot than a bar plot.
Problem
Then the question for you is: what does identity do?
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y = after_stat(prop)))
1.00
0.75
prop
0.50
0.25
0.00
What ggplot actually does is to take the levels of cut, which are Fair,
Good, Very Good, Premium and Ideal, and calculate proportions for each
bin.
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y = after_stat(prop),
group = "dummy argument"))
0.4
0.3
prop
0.2
0.1
0.0
Specifying a value for group makes sure that each level is proportional to
all levels.
For example, you might use stat_summary, which summarizes the y values
for each unique x value, to draw attention to the summary that you are
computing.
p<-ggplot(data = diamonds) +
stat_summary(
mapping = aes(x = cut, y = depth),
fun.min = min,
fun.max = max,
fun = median
)
p
80
70
depth
60
50
p<-ggplot(data = diamonds) +
stat_summary(
mapping = aes(x = cut, y = depth),
fun.min = function(x) mean(x) - sd(x),
fun.max = function(x) mean(x) + sd(x),
fun = mean
)
p
68
66
depth
64
62
60
p<-ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, colour = cut))
20000
15000
cut
Fair
count
Good
Very Good
10000
Premium
Ideal
5000
Here we have used the colour aesthetic. Let us check what happen if we
use fill.
p<-ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = cut))
20000
15000
cut
Fair
count
Good
Very Good
10000
Premium
Ideal
5000
Notice that the x variable and the fill aesthetic are assigned the same
variable, cut. What happens if we change one of the two?
p<-ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity))
20000
15000
clarity
I1
SI2
SI1
count
VS2
10000 VS1
VVS2
VVS1
IF
5000
5000
4000
clarity
I1
3000 SI2
SI1
count
VS2
VS1
VVS2
2000
VVS1
IF
1000
5000
4000
clarity
I1
3000 SI2
SI1
count
VS2
VS1
VVS2
2000
VVS1
IF
1000
position = "fill" works like stacking, but makes each set of stacked
bars the same height. This makes it easier to compare proportions across
groups.
p<-ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity),
position = "fill")
1.00
0.75
clarity
I1
SI2
SI1
count
0.50 VS2
VS1
VVS2
VVS1
IF
0.25
0.00
p<-ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity),
position = "dodge")
5000
4000
clarity
I1
3000 SI2
SI1
count
VS2
VS1
VVS2
2000
VVS1
IF
1000
In some data sets we may have values that coincide. This causes some
points in, say, a scatterplot to overlap.
This property of the data may provide wrong insights. One thing that we
can do is to nudge our points with some random noise.
p<-ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy),
position = "jitter")
40
30
hwy
20
2 3 4 5 6 7
displ
ggplot has a couple of functions that may be used for this specific purpose.
40
30
hwy
20
suv
subcompact
pickup
class
minivan
midsize
compact
2seater
20 30 40
hwy
library(maps)
nz <- map_data("nz")
p<-ggplot(nz, aes(long, lat, group = group)) +
geom_polygon(fill = "white", colour = "black")
−36
−40
lat
−44
−48
170 175
long
−36
−40
lat
−44
−48
170 175
long
Problem
coord_polar uses polar coordinates.
See what happens when you apply polar coordinates to our bar plot.
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(
mapping = aes(<MAPPINGS>),
stat = <STAT>,
position = <POSITION>
) +
<COORDINATE_FUNCTION> +
<FACET_FUNCTION>
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(
mapping = aes(<MAPPINGS>),
stat = <STAT>,
position = <POSITION>
) +
<COORDINATE_FUNCTION> +
<FACET_FUNCTION>
We will consider a data set containing all 336,776 flights that departed
from New York City in 2013. We will do our manipulations with the
package dplyer (already in the tydiverse).
library(nycflights13)
flights
# A tibble: 336,776 x 19
year month day dep_time sched_dep_time dep_delay
<int> <int> <int> <int> <int> <dbl>
1 2013 1 1 517 515 2
2 2013 1 1 533 529 4
3 2013 1 1 542 540 2
4 2013 1 1 544 545 -1
5 2013 1 1 554 600 -6
6 2013 1 1 554 558 -4
7 2013 1 1 555 600 -5
8 2013 1 1 557 600 -3
9 2013 1 1 557 600 -3
10 2013 1 1 558 600 -2
# i 336,766 more rows
# i 13 more variables: arr_time <int>,
# sched_arr_time <int>, arr_delay <dbl>,
# carrier <chr>, flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>,
# distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>
If you have worked with R already you will notice some differences with
other data frame structures.
This data set provides us with some information about the variables, the
size of the data set and the type of variables we are dealing with.
In particular
1 int stands for integers.
2 dbl stands for doubles.
3 chr stands for character vectors or strings.
4 dttm stands for date-times (a date + a time).
5 lgl stands for logical, vectors that contain only TRUE or FALSE.
6 fctr stands for factors, which R uses to represent categorical
variables with fixed possible values.
7 date stands for dates.
# A tibble: 719 x 19
year month day dep_time sched_dep_time dep_delay
<int> <int> <int> <int> <int> <dbl>
1 2013 12 25 456 500 -4
2 2013 12 25 524 515 9
3 2013 12 25 542 540 2
4 2013 12 25 546 550 -4
5 2013 12 25 556 600 -4
6 2013 12 25 557 600 -3
7 2013 12 25 557 600 -3
8 2013 12 25 559 600 -1
9 2013 12 25 559 600 -1
10 2013 12 25 600 600 0
# i 709 more rows
# i 13 more variables: arr_time <int>,
# sched_arr_time <int>, arr_delay <dbl>,
# carrier <chr>, flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>,
# distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>
Notice that comparisons are carried out using the usual symbols: >, >= <,
<=, !=, ==
sqrt(2) ^ 2 == 2
[1] FALSE
near(sqrt(2) ^ 2, 2)
[1] TRUE
# A tibble: 55,403 x 19
year month day dep_time sched_dep_time dep_delay
<int> <int> <int> <int> <int> <dbl>
1 2013 11 1 5 2359 6
2 2013 11 1 35 2250 105
3 2013 11 1 455 500 -5
4 2013 11 1 539 545 -6
5 2013 11 1 542 545 -3
6 2013 11 1 549 600 -11
7 2013 11 1 550 600 -10
8 2013 11 1 554 600 -6
9 2013 11 1 554 600 -6
10 2013 11 1 554 600 -6
# i 55,393 more rows
# i 13 more variables: arr_time <int>,
# sched_arr_time <int>, arr_delay <dbl>,
# carrier <chr>, flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>,
# distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>
# A tibble: 27,004 x 19
year month day dep_time sched_dep_time dep_delay
<int> <int> <int> <int> <int> <dbl>
1 2013 1 1 517 515 2
2 2013 1 1 533 529 4
3 2013 1 1 542 540 2
4 2013 1 1 544 545 -1
5 2013 1 1 554 600 -6
6 2013 1 1 554 558 -4
7 2013 1 1 555 600 -5
8 2013 1 1 557 600 -3
9 2013 1 1 557 600 -3
10 2013 1 1 558 600 -2
# i 26,994 more rows
# i 13 more variables: arr_time <int>,
# sched_arr_time <int>, arr_delay <dbl>,
# carrier <chr>, flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>,
# distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>
# A tibble: 55,403 x 19
year month day dep_time sched_dep_time dep_delay
<int> <int> <int> <int> <int> <dbl>
1 2013 11 1 5 2359 6
2 2013 11 1 35 2250 105
3 2013 11 1 455 500 -5
4 2013 11 1 539 545 -6
5 2013 11 1 542 545 -3
6 2013 11 1 549 600 -11
7 2013 11 1 550 600 -10
8 2013 11 1 554 600 -6
9 2013 11 1 554 600 -6
10 2013 11 1 554 600 -6
# i 55,393 more rows
# i 13 more variables: arr_time <int>,
# sched_arr_time <int>, arr_delay <dbl>,
# carrier <chr>, flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>,
# distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>
“Select the flights that do not have an arrival delay (A) greater than 120
nor a departure delay (B) greater than 120.”
In logical terms is
¬(A ∪ B) = ¬A ∩ ¬B
[1] TRUE
# A tibble: 336,776 x 19
year month day dep_time sched_dep_time dep_delay
<int> <int> <int> <int> <int> <dbl>
1 2013 1 9 641 900 1301
2 2013 6 15 1432 1935 1137
3 2013 1 10 1121 1635 1126
4 2013 9 20 1139 1845 1014
5 2013 7 22 845 1600 1005
6 2013 4 10 1100 1900 960
7 2013 3 17 2321 810 911
8 2013 6 27 959 1900 899
9 2013 7 22 2257 759 898
10 2013 12 5 756 1700 896
# i 336,766 more rows
# i 13 more variables: arr_time <int>,
# sched_arr_time <int>, arr_delay <dbl>,
# carrier <chr>, flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>,
# distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>
# A tibble: 336,776 x 3
year month day
<int> <int> <int>
1 2013 1 1
2 2013 1 1
3 2013 1 1
4 2013 1 1
5 2013 1 1
6 2013 1 1
7 2013 1 1
8 2013 1 1
9 2013 1 1
10 2013 1 1
# i 336,766 more rows
# A tibble: 336,776 x 3
year month day
<int> <int> <int>
1 2013 1 1
2 2013 1 1
3 2013 1 1
4 2013 1 1
5 2013 1 1
6 2013 1 1
7 2013 1 1
8 2013 1 1
9 2013 1 1
10 2013 1 1
# i 336,766 more rows
# A tibble: 336,776 x 16
dep_time sched_dep_time dep_delay arr_time
<int> <int> <dbl> <int>
1 517 515 2 830
2 533 529 4 850
3 542 540 2 923
4 544 545 -1 1004
5 554 600 -6 812
6 554 558 -4 740
7 555 600 -5 913
8 557 600 -3 709
9 557 600 -3 838
10 558 600 -2 753
# i 336,766 more rows
# i 12 more variables: sched_arr_time <int>,
# arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>
When dealing with large data sets, we may not know exactly the name of
the variables of interest, or we may just know a part of them.
In this case we can use some helper functions
1 starts_with("abc"): matches names that begin with “abc”
2 ends_with("xyz"): matches names that end with “xyz”
3 contains("ijk"): matches names that contain “ijk”
4 num_range("x",1:3): matches x1, x2, x3
Other functions can be found by checking select.
# A tibble: 336,776 x 8
dep_time sched_dep_time arr_time sched_arr_time
<int> <int> <int> <int>
1 517 515 830 819
2 533 529 850 830
3 542 540 923 850
4 544 545 1004 1022
5 554 600 812 837
6 554 558 740 728
7 555 600 913 854
8 557 600 709 723
9 557 600 838 846
10 558 600 753 745
# i 336,766 more rows
# i 4 more variables: air_time <dbl>, day <int>,
# dep_delay <dbl>, arr_delay <dbl>
# A tibble: 336,776 x 0
The verb mutate is arguably one of the most useful as it allows to operate
on the existing variables to create new ones. The new variables are
automatically included into the data set.
# A tibble: 336,776 x 7
year month day dep_delay arr_delay distance
<int> <int> <int> <dbl> <dbl> <dbl>
1 2013 1 1 2 11 1400
2 2013 1 1 4 20 1416
3 2013 1 1 2 33 1089
4 2013 1 1 -1 -18 1576
5 2013 1 1 -6 -25 762
6 2013 1 1 -4 12 719
7 2013 1 1 -5 19 1065
8 2013 1 1 -3 -14 229
9 2013 1 1 -3 -8 944
10 2013 1 1 -2 8 733
# i 336,766 more rows
# i 1 more variable: air_time <dbl>
mutate(flights_sml,
gain = dep_delay - arr_delay,
speed = distance / air_time * 60
)
# A tibble: 336,776 x 9
year month day dep_delay arr_delay distance
<int> <int> <int> <dbl> <dbl> <dbl>
1 2013 1 1 2 11 1400
2 2013 1 1 4 20 1416
3 2013 1 1 2 33 1089
4 2013 1 1 -1 -18 1576
5 2013 1 1 -6 -25 762
6 2013 1 1 -4 12 719
7 2013 1 1 -5 19 1065
8 2013 1 1 -3 -14 229
9 2013 1 1 -3 -8 944
10 2013 1 1 -2 8 733
# i 336,766 more rows
# i 3 more variables: air_time <dbl>, gain <dbl>,
# speed <dbl>
transmute(flights,
gain = dep_delay - arr_delay,
hours = air_time / 60,
gain_per_hour = gain / hours
)
# A tibble: 336,776 x 3
gain hours gain_per_hour
<dbl> <dbl> <dbl>
1 -9 3.78 -2.38
2 -16 3.78 -4.23
3 -31 2.67 -11.6
4 17 3.05 5.57
5 19 1.93 9.83
6 -16 2.5 -6.4
7 -24 2.63 -9.11
8 11 0.883 12.5
9 5 2.33 2.14
10 -10 2.3 -4.35
# i 336,766 more rows
The operator %/% stands for integer division and %% returns the remainder.
x==y*(x%/%y)+(x%%y)
transmute(flights,
dep_time,
hour = dep_time %/% 100,
minute = dep_time %% 100
)
# A tibble: 336,776 x 3
dep_time hour minute
<int> <dbl> <dbl>
1 517 5 17
2 533 5 33
3 542 5 42
4 544 5 44
5 554 5 54
6 554 5 54
7 555 5 55
8 557 5 57
9 557 5 57
10 558 5 58
# i 336,766 more rows
Once the data is grouped we can invoke summarise to collapse each group
into a single row summary. This generally happens via some summarizing
function.
# A tibble: 1 x 1
delay
<dbl>
1 12.6
# A tibble: 16 x 2
carrier delay
<chr> <dbl>
1 9E 16.7
2 AA 8.59
3 AS 5.80
4 B6 13.0
5 DL 9.26
6 EV 20.0
7 F9 20.2
8 FL 18.7
9 HA 4.90
10 MQ 10.6
11 OO 12.6
12 UA 12.1
13 US 3.78
14 VX 12.9
15 WN 17.7
16 YV 19.0
40
30
count
20
4000
delay
8000
12000
16000
10
−10
0 1000 2000
dist
We can do the same thing in a faster way using the pipe %>%.
40
30
count
20
4000
delay
8000
12000
16000
10
−10
0 1000 2000
dist
What does the pipe do exactly? Basically, it passes an object (e.g. a data
frame) to the first argument of a function.
What we have is that x %>% f(y) turns into f(x, y), x %>% f(y) %>%
g(z) turns into g(f(x, y), z). . .
The idea that lurks behind this process is to take qualitative questions and
turn them into quantitative answers.
If you want you can see this as a deeply creative process and maybe even a
form of art.
Despite the gazillion of possibilities to start EDA, there exist some fixed
points. You should ask
What type of variation occurs within my variables?
What type of covariation occurs between my variables?
20000
15000
count
10000
5000
# A tibble: 5 x 2
cut n
<ord> <int>
1 Fair 1610
2 Good 4906
3 Very Good 12082
4 Premium 13791
5 Ideal 21551
30000
20000
count
10000
0 2 4
carat
diamonds %>%
count(cut_width(carat, 0.5))
# A tibble: 11 x 2
`cut_width(carat, 0.5)` n
<fct> <int>
1 [-0.25,0.25] 785
2 (0.25,0.75] 29498
3 (0.75,1.25] 15977
4 (1.25,1.75] 5313
5 (1.75,2.25] 2002
6 (2.25,2.75] 322
7 (2.75,3.25] 32
8 (3.25,3.75] 5
9 (3.75,4.25] 4
10 (4.25,4.75] 1
AA 2023-24 298 / 453
Introducing the tidyverse
Exploratory data analysis
The width of the bin is called the binwidth. Different binwidths may reveal
different patterns.
10000
7500
count
5000
2500
1 2
carat
This knowledge may be used to ask more questions. Which are the most
common (rare) values and why? Is this what we expected? Can we see
some special pattern and why?
2000
count
1000
1 2
carat
Problem
The above picture suggests the existence of clusters. Investigate why such
clusters appear.
Outliers may be just data entry errors or may reveal some deeper properties
of the data generating process.
ggplot(diamonds) +
geom_histogram(mapping = aes(x = y), binwidth = 0.5)
12000
8000
count
4000
0 20 40 60
y
When we have loads of data they may be difficult to see, even though the
histogram provides a clue (the fact that it is very skewed to the left). Let
us zoom in.
ggplot(diamonds) +
geom_histogram(mapping = aes(x = y), binwidth = 0.5) +
coord_cartesian(ylim = c(0, 50))
50
40
30
count
20
10
0 20 40 60
y
We see that there are diamonds of size zero (?). They are probably
mistakes. We also see that there are a couple of big ones.
# A tibble: 9 x 4
price x y z
<int> <dbl> <dbl> <dbl>
1 5139 0 0 0
2 6381 0 0 0
3 12800 0 0 0
4 15686 0 0 0
5 18034 0 0 0
6 2130 0 0 0
7 2130 0 0 0
8 2075 5.15 31.8 5.12
9 12210 8.09 58.9 8.06
By looking at the price we may suspect that this is also a data entry
mistake.
To fix this you may either drop the line where the erroneous data are or,
maybe more appropriately replace them with NA.
6
y
0 3 6 9
x
In our flight data, for example, missing values in departure time may
indicate that the flight was canceled.
p<-nycflights13::flights %>%
mutate(
cancelled = is.na(dep_time),
sched_hour = sched_dep_time %/% 100,
sched_min = sched_dep_time %% 100,
sched_dep_time = sched_hour + sched_min / 60
) %>%
ggplot(mapping = aes(sched_dep_time,)) +
geom_freqpoly(mapping = aes(colour = cancelled),
binwidth = 1/4)
10000
7500
cancelled
count
FALSE
5000
TRUE
2500
0 5 10 15 20 25
sched_dep_time
Unfortunately, the result is not that great due to the fact that there are
more non canceled flights than canceled flights.
p<-nycflights13::flights %>%
mutate(
cancelled = is.na(dep_time),
sched_hour = sched_dep_time %/% 100,
sched_min = sched_dep_time %% 100,
sched_dep_time = sched_hour + sched_min / 60
) %>%
ggplot(mapping = aes(sched_dep_time,after_stat(density))) +
geom_freqpoly(mapping = aes(colour = cancelled),
binwidth = 1/4)
0.15
0.10
cancelled
density
FALSE
TRUE
0.05
0.00
0 5 10 15 20 25
sched_dep_time
Problem
Comment on the above picture.
5000
4000
cut
3000
Fair
count
Good
Very Good
Premium
Ideal
2000
1000
Given the heterogeneity in counts, this may not be the best solution.
4e−04
3e−04
cut
Fair
density
Good
Very Good
Premium
2e−04
Ideal
1e−04
0e+00
We see a bump in the lowest quality diamonds (fair) that suggests that
their average price may be the highest.
Boxplots feature median, the interquartile range (the box), two lines that
denote the range of the non outlier observations (the whiskers) and a
bunch of outside points that stretch beyond the whiskers (outliers).
15000
10000
price
5000
Problem
The above plot seems to confirm our conjecture on the mean or median
prices of low quality diamonds. As an exercise dig deeper and try to
understand what is going on.
40
30
hwy
20
We may reorder the plot according to a given quantity, say, the median.
ggplot(data = mpg) +
geom_boxplot(mapping =
aes(x = reorder(class, hwy, FUN = median),
y = hwy))
40
30
hwy
20
ggplot(data = diamonds) +
geom_count(mapping = aes(x = cut, y = color))
n
1000
color
G 2000
3000
4000
ggplot(data = diamonds) +
geom_count(mapping = aes(x = cut, y = color))
n
1000
color
G 2000
3000
4000
The size of the point is proportional to the count for the given combination.
diamonds %>%
count(color, cut)
# A tibble: 35 x 3
color cut n
<ord> <ord> <int>
1 D Fair 163
2 D Good 662
3 D Very Good 1513
4 D Premium 1603
5 D Ideal 2834
6 E Fair 224
7 E Good 933
8 E Very Good 2400
9 E Premium 2337
10 E Ideal 3903
# i 25 more rows
diamonds %>%
count(color, cut) %>%
ggplot(mapping = aes(x = color, y = cut)) +
geom_tile(mapping = aes(fill = n))
Ideal
Premium
4000
cut
2000
1000
Good
Fair
D E F G H I J
color
ggplot(data = diamonds) +
geom_point(mapping = aes(x = carat, y = price))
15000
10000
price
5000
0 1 2 3 4 5
carat
When our data set gets large, the points we put on the graph may be
overplotted by other points. A workaround for this undesirable feature is to
use shades.
ggplot(data = diamonds) +
geom_point(mapping = aes(x = carat, y = price)
, alpha = 1 / 100)
15000
10000
price
5000
0 1 2 3 4 5
carat
library(hexbin)
ggplot(data = smaller) +
geom_hex(mapping = aes(x = carat, y = price))
15000
count
6000
10000
price
4000
2000
5000
1 2
carat
We will conclude here our extremely brief journey into data analysis. We
only covered a small fraction of the potential topics. For such a reason I
urge you to explore the book by Wickham & Grolemund beyond the topics
covered.
We will consider now various applications. We will not spend too much
time on the theoretical aspects of the methods and we will try to
understand how to make things work in practice.
The examples come from the various books mentioned in your reading list.
Suppose we are working with a certain estimation method (OLS, IV, GMM,
or something else).
Hence, our empirical analysis would rely on the fact that our asymptotic
distributions are sufficiently accurate to describe the finite sample
distributions that are, by the way, unknown.
This is, we sample a large number of data sets from a given distribution.
The probit model is used (along with other binary choice models) when the
outcome variable takes only two values.
Consider
yi∗ = xi′ θ + εi
where yi is observed.
Moreover,
E [1(yi∗ > 0)|xi ] = E [1(xi′ θ + εi > 0)] = P(εi > −xi′ θ) = Φ(−xi′ θ).
It is well known that the loglikelihood function for the probit model is
n
X
yi log Φ(xi′ θ) + (1 − yi ) log(1 − Φ(xi′ θ))
ℓ=
i=1
where
(ϕ(xi′ θ))2
Ω=E xi x ′
Φ(xi′ θ)(1 − Φ(xi′ θ)) i
l.probit<-function(theta,y,x){
Phi<-pnorm(x%*%theta)
logl<- sum(y*log(Phi))+sum((1-y)*log(1-Phi))
return(-logl)
}
for (i in 1:R) {
x<-cbind(1,rnorm(n,1,1),rnorm(n,1,1))
eps<- rnorm(n)
y.star<- x%*%theta0+eps
y<-1*(y.star>0)
res<-optim(theta0,l.probit,x=x,y=y,
hessian=TRUE, method = "L-BFGS-B")
theta.hat[i,]<-res$par
Omega.hat<-solve(res$hessian)
se[i,]<-sqrt(diag(Omega.hat))
t.test[i,]<- (theta.hat[i,]-theta0)/se[i,]
}
#glm1<-glm(y ~ -1+x, family = binomial(link="probit"))
300
200
count
100
−2 0 2
t.test
#hist(t[,2], freq=TRUE)
The researcher chooses α but we want our test not to exceed such value,
so to speak.
We generally do not know whether our test does overreject or not. We use
simulation to build insight.
For example, what is the size of the t test we calculated via Monte Carlo
simulation in the previous slide?
Notice
We know that H0 is true because we built it that way, so we can use the
analogy principle
R
1X
1(|t| > q).
R
r =1
mean(1*(abs(t.test[,2])>1.96))
[1] 0.053
A related quantity is
t.test1<-(theta.hat[,2]-0.100001)/se[,2]
mean(1*(abs(t.test1)>1.96))
[1] 0.053
t.test1<-(theta.hat[,2]-0.2)/se[,2]
mean(1*(abs(t.test1)>1.96))
[1] 0.115
Problem
Try to draw a picture where on the x axis you have a set of alternatives and
on the y axis you place the rejection rates.
We can use the simulation machinery we built in our next endeavour, which
is the bootstrap.
We do not see the population but we haveP a sample from which we can
b = n1 ni=1 xi .
calculate an estimator for µ, say µ
In the bootstrap world the population is the sample and the sample is the
resample.
We do not know the left hand side but we do know the right hand side.
set.seed(1234)
x<-rt(100,5);y<-rchisq(100,2);B<-499
x.bar.star<-y.bar.star<-c()
for(i in 1:B){
x.bar.star[i]<-mean(sample(x,replace=TRUE))
y.bar.star[i]<-mean(sample(y,replace=TRUE))
}
mean(x)-0;mean(x.bar.star)-mean(x)
[1] 0.0294
[1] -0.000656
mean(y)-2;mean(y.bar.star)-mean(y)
[1] 0.476
[1] -0.0145
data("Journals")
journals <- Journals[, c("subs", "price")]
journals$citeprice <- Journals$price/Journals$citations
y<-log(journals$subs)
X<-cbind(1,log(journals$citeprice))
(beta.hat<-solve(crossprod(X))%*%crossprod(X,y))
[,1]
[1,] 4.766
[2,] -0.533
e<-y-X%*%beta.hat
V.hat<-solve(crossprod(X)/length(y))*
as.numeric(crossprod(e)/length(y))
t<-sqrt(length(y))*(beta.hat[2]-0)/sqrt(V.hat[2,2])
√ β2∗ − βb2
t∗ = n
σ2∗
set.seed(2356)
B<-99
t.star<-c()
n<-length(y)
for (i in 1:B) {
rsmpl<-sample(1:n,replace = TRUE)
y.star<-y[rsmpl];X.star<-X[rsmpl,]
beta.star<-solve(crossprod(X.star))%*%
crossprod(X.star,y.star)
e.star<-y.star-X.star%*%beta.star
V.star<-as.numeric(crossprod(e.star)/n)*
solve(crossprod(X.star)/n)
t.star[i]<-sqrt(n)*(beta.star[2]-beta.hat[2])/
sqrt(V.star[2,2])
}
AA 2023-24 382 / 453
Applications
The bootstrap
hist(t.star)
20
15 Histogram of t.star
Frequency
10
5
0
−2 −1 0 1 2
t.star
if(abs(t)>=quantile(t.star,0.975)){
decision<-"reject the null"
}else{
decision<-"cannot reject the null"
}
decision
Problem
Here are some simple and not so simple questions for you
1 Can you work out a 95% confidence interval for the above example?
2 Provide a similar result to the above using the percentile method.
3 The methods we use do not necessarily apply in the presence of serial
dependence or heteroskedasticity. Provide simulation evidence in the
context of the linear regression model.
It is important to bear in mind that looking at the data alone does not help
us to uncover causal relationships.
For that we need some knowledge of the process we are studying. We hope
that such knowledge can be represented with a DAG.
n<-1000
x1<-rnorm(n);y1<-x1+1+sqrt(3)*rnorm(n)
y2<-1+2*rnorm(n);x2<-(y2-1)/4+sqrt(3)*rnorm(n)/2
z3<-rnorm(n);y3<-z3+1+sqrt(3)*rnorm(n);x3<-z3
cor(x1,y1);cor(x2,y2);cor(x3,y3)
[1] 0.539
[1] 0.494
[1] 0.493
library("ggExtra")
p <- ggplot(as.data.frame(cbind(x1,y1)), aes(x=x1, y=y1))+
geom_point(col="red")
ggMarginal(p, type = "density")
5
y1
−5
−2 0 2
x1
library("ggExtra")
p <- ggplot(as.data.frame(cbind(x2,y2)), aes(x=x2, y=y2))+
geom_point(col="blue")
ggMarginal(p, type = "density")
5.0
2.5
y2
0.0
−2.5
−5.0
−2 0 2
x2
library("ggExtra")
p <- ggplot(as.data.frame(cbind(x3,y3)), aes(x=x3, y=y3))+
geom_point(col="green")
ggMarginal(p, type = "density")
7.5
5.0
2.5
y3
0.0
−2.5
−5.0
−2 0 2
x3
The data are produced in different ways yet from the scatter plots their
joint distributions are indistinguishable.
Problem
Look at the DGPs for x, y and z and study their corresponding DAG
representation. You may use the library daggity.
We start with the do-calculus and we study two relevant cases: IV and the
front-door criterion (you may remember the back-door criterion we saw in
Econometrics I).
The do-calculus is a set of (three) rules that can be applied to a DAG and
that help us identify causal effects.
When dealing with the do-calculus you will see the so called do-operator
denoted as do(X = x) or do(x).
Let us introduce some useful notation and see some details using equations
and DAGs.
library(dagitty)
library(ggdag)
library(patchwork) # For combining plots
p.rule1_g<-ggdag(rule1_g)+theme_dag()
p.rule1_g_x_over<-ggdag(rule1_g_x_over)+theme_dag()
p.rule1_g | p.rule1_g_x_over
Z W Z W
X Y X Y
You may recall that our question is whether in computing the causal effect
X → Y we need to include somehow Z or can we just plain ignore it.
We can apply rule 1 to GX and notice that once we condition upon W and
X , Y and Z are d-separated. Thus
where GX ,Z is the graph where we remove the arrows coming out of Z and
the arrows coming into X .
p.rule2_g<-ggdag(rule2_g)+theme_dag()
p.rule2_g_modif<-ggdag(rule2_g_modif)+theme_dag()
p.rule2_g | p.rule2_g_modif
Z W Z W
X Y X Y
Do not get confused by the fact that we have two interventions here.
Typically, in applications you see only one.
Rule 3 is the most complicated one and it tells us when we can remove a
do-modified variable. Formally,
where Z (W ) reads “any node Z that is not an ancestor of W ” (we will see
how it plays out in the graph).This is, we can remove do(z) if there is no
association or no unblocked paths from Z to Y .
p.rule3_g<-ggdag(rule3_g)+theme_dag()
p.rule3_g_modif<-ggdag(rule3_g_modif)+theme_dag()
p.rule3_g | p.rule3_g_modif
Z W Z W
X Y X Y
Our operations have d-separated Z from all the other variables. Hence, we
can remove do(z)
Using the rules of do-calculus we can derive the formulae for the backdoor
and frontdoor adjustment.
Since you have an idea of what the backdoor criterion is, we explore the
frontdoor criterion with an example.
The above description of the rules of the do-calculus is not too intuitive.
p<-ggdag(dag)+theme_dag();p
Y = gY (M, U) = a1 + a2 M + a3 U + VY
X = gX (U) = b1 + b2 U + VX
M = gM (X ) = c1 + c2 X + VM
This effect is not identified due to the confounding effect of U and since it
is unobserved we cannot block the backdoor path.
Y = a1 + a2 M + a3 U + VY = a1 + a2 (c1 + c2 X + VM ) + a3 U + VY
= (a1 + a2 c2 ) + a2 c2 X + a3 U + (a2 VM + VY ).
However, we can estimate a2 and c2 separately and then take the product.
Y = gY (M, U) = a1 + a2 M + a3 U + VY
X = gX (U) = b1 + b2 U + VX
M = gM (X ) = c1 + c2 X + VM .
Y = β0 + β1 X + ε
does not recover the causal effect. Using the information described in the
the corresponding DAG show how to consistently estimate the causal effect.
To see a numerical example for the frontdoor criterion look at this page
from Felix Thoemmes’ website and an example geared more towards
economics here from Alex Chinco’s website. The discussion on the
do-calculus rule heavily draws from this post by Andrew Heiss (you may
also like this chapter) and these posts from Ferenc Huszár’s website.
You may notice that much of this discussion relies little on the literature by
scholars in the field of economics. This is because economists have started
working with graphical models only recently.
Y = a + bX + U
library(ggdag)
dag2 <- tidy_dagitty(dag)
p<-ggdag(dag2) +
theme_dag()
dag <-
dagitty("dag {
X ->
Y
U ->
Y
U ->
X
}")
dag2 <- tidy_dagitty(dag)
p<-ggdag(dag2) + theme_dag()
As an exercise devise a Monte Carlo experiment and show that the OLS
estimator for the causal effect X → Y is biased.
Suppose now that there exists another variable, say Z , such that Z → X
and causes Y only through X . To be specific. . .
dag <-
dagitty("dag {
X ->
Y
U ->
Y
U ->
X
Z ->
X
}")
dag2 <- tidy_dagitty(dag)
p<-ggdag(dag2) + theme_dag()
Y = g (Y , U) = a + bX + cU
X = h(Z , U) = d + eZ + fU
A simple linear model cannot capture the causal effect due to confounding
(common causes of income and education).
Problem
Draw a DAG that features the relation between education, income,
geographic variation and confounding variables. Then, using Card’s data
(SchoolingReturns) estimate the effect of education on income using IV.
Card’s idea, for example, was to use geographic data (distance to college)
as instruments to consistently estimate the causal effect of education of
income.
E [Y |Z = 1, T ] = E [Y |Z = 1, X = 1, T ]P(X = 1|Z = 1, T )
+ E [Y |Z = 1, X = 0, T ]P(X = 0|Z = 1, T ).
E [Y |Z = 1] − E [Y |Z = 0]
E [Y |X = 1, C ] − E [Y |X = 0, C ] =
P[X = 1|Z = 1] − P[X = 1|Z = 0]
µy − µy0
= 1
p11 − p10
Problem
Implement the above code using Card’s data. Comment on the results.
library(ivreg)
data("SchoolingReturns")
y <- SchoolingReturns$wage
X <- 1*(SchoolingReturns$education>12)
Z <- 1*(SchoolingReturns$nearcollege=="yes")
[1] 731
[,1]
208
X 731