0% found this document useful (0 votes)
8 views453 pages

Pse2023 1

Uploaded by

Fakebba Darboe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views453 pages

Pse2023 1

Uploaded by

Fakebba Darboe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 453

Programming Skills for Economics

Federico Crudu
University of Siena

AA 2023-24

AA 2023-24 1 / 453
What is this course about?

1 Introduction to programming in R
2 Introduction to the tidyverse
3 Applications from economics and econometrics: optimization, Monte
Carlo simulation and the bootstrap, causality, estimation of structural
models, networks, growth models.

AA 2023-24 2 / 453
Introducing R

We will introduce R, its basic features and we will see how to apply it to
simple data problems first.

The reference book for this part is An Introduction to R by W. N.


Venables, D. M. Smith and the R Core Team.

If you want to learn R in a quick and dirty fashion go to Appendix A of the


above book.

AA 2023-24 3 / 453
Introducing R

R is many things:
1 an effective data handling and storage facility
2 a suite of operators for calculations on arrays (vectors, matrices)
3 integrated collection of tools for data analysis
4 graphical facilities for data analysis
5 a programming language

AA 2023-24 4 / 453
Introducing R

We will work with R together with RStudio.

RStudio is an integrated development environment (IDE) for R

It includes a console, syntax-highlighting editor that supports direct code


execution, as well as tools for plotting, history and workspace management.

AA 2023-24 5 / 453
Introducing R

Notice that there are many (not necessarily overlapping) alternatives to R.

Besides programming languages such as C and FORTRAN, you have


proprietary packages like GAUSS and Matlab, free software like Python and
the more recent Julia.

RStudio allows to run Python code and also Julia, although not as
smoothly. Alternatives are Jupyter, Atom, VS Code.

You can also consider Google Colaboratory.

AA 2023-24 6 / 453
Introducing R
Help!!!

One of the most important and more frequently used function is the help
function. Suppose we want to know what the function solve does

help(solve)

Alternatively

?solve

If we are after a special feature or some special words we need quotes

help("if")

You may want to check what ?? does. E.g.,

??matrix
AA 2023-24 7 / 453
Introducing R
Help!!!

Probably the best way to get help is to use Google and/or AI chats.

AA 2023-24 8 / 453
Introducing R
Commands

The R language is case sensitive, this means that vector and Vector are
two different things.

Elementary commands consist of expressions or assignments. The former


when given is evaluated and printed and then lost. The latter is passed to a
variable but not printed. E.g.

1+1 # expression

[1] 2

a<-1+1 # assignment
a

[1] 2

AA 2023-24 9 / 453
Introducing R
Commands

Notice from the previous slide that if we want to comment a line of code
you can use the symbol #

# this is nothing

With some exceptions you can place comments everywhere.

Furthermore, two commands can be separated via a new line or ";"

1+1; a<-1+1

[1] 2

AA 2023-24 10 / 453
Introducing R
Source files and diverting output

Sometimes we may need to write our commands in an external file. To call


the commands from such file we may use the function source

source("ABunchOfCommands.R")

If you want to drop something in a file, you can use the following

# sinks something into a text file


sink("sinksomething.txt")
for (i in 1:20)
print(i)
sink()
# this last sink brings back the status quo

AA 2023-24 11 / 453
Introducing R
Data permanency and removing objects

Objects are the entities that R manipulates and they may be of various
nature (variables, vectors, character strings, functions,...). To see which
objects are present in an R session we type

objects()

The collection of objects currently stored is called the workspace.

Object can be removed using rm

rm(x, y, z, ink, junk, temp, foo, bar)

Check the command ls() what happens if you use it.

AA 2023-24 12 / 453
Introducing R
Vectors and assignment

We perform operations in R on given data structures. The simplest of them


is the vector, an object consisting of an ordered collection of numbers

x <- c(10.4, 5.6, 3.1, 6.4, 21.7)

Here we have used an assignment, the symbol <-, and the function c().
Notice that sometimes the symbol = is used, I would discourage that.

You can also use the more involved expression

assign("x", c(10.4, 5.6, 3.1, 6.4, 21.7))

AA 2023-24 13 / 453
Introducing R
Vectors and assignment

Notice also that you can concatenate vectors and numbers

y <- c(x, 0, x)
y

[1] 10.4 5.6 3.1 6.4 21.7 0.0 10.4 5.6 3.1 6.4
[11] 21.7

Can you use the assignment symbol in reverse order, i.e. ->?

AA 2023-24 14 / 453
Introducing R
Vector arithmetic

Vectors can be used in arithmetic expressions, operations are performed


element by element. Vectors need not all be of the same length! A recycle
rule applies

v <- 2*x + y + 1
v; x; y

[1] 32.2 17.8 10.3 20.2 66.1 21.8 22.6 12.8 16.9 50.8
[11] 43.5
[1] 10.4 5.6 3.1 6.4 21.7
[1] 10.4 5.6 3.1 6.4 21.7 0.0 10.4 5.6 3.1 6.4
[11] 21.7

# you should get a warning here

Be careful when using non conformable vectors (or matrices)!


AA 2023-24 15 / 453
Introducing R
Vector arithmetic

The elementary arithmetic operators are the usual +, -, *, / and ˆ. log,


exp, sin, cos, tan, sqrt are available as well.

Other operators exist that may be useful in a number of circumstances, e.g.


range that returns c(min(x),max(x)), length, sum, prod

range(x); length(x); sum(x); prod(x)

[1] 3.1 21.7


[1] 5
[1] 47.2
[1] 25074

AA 2023-24 16 / 453
Introducing R
Vector arithmetic

We can perform simple operations that are useful for statistical work

mean(x); var(x)

[1] 9.44
[1] 53.9

which are equivalent to

sum(x)/length(x); sum((x-mean(x))^2)/(length(x)-1)

[1] 9.44
[1] 53.9

AA 2023-24 17 / 453
Introducing R
Vector arithmetic

Problem
Sometimes you need to sort your data in a given order. You can do this in
a number of ways. Try to see ?sort and ?order.

What happens if we take the square root of a negative number? What if


we want to work with complex numbers?

AA 2023-24 18 / 453
Introducing R
Generating regular sequences

There exist a number of useful and flexible functions that allow us to


generate sequences of numbers. The brute force way is to use c(). This is
clearly unfeasible when the sequence is very large. Hence
1:30

[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
[18] 18 19 20 21 22 23 24 25 26 27 28 29 30

Be careful when using the operator : as it has higher priority within an


expression
2*1:15; (2*1):15; 2*(1:15); 2*1:(15)

[1] 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
[1] 2 3 4 5 6 7 8 9 10 11 12 13 14 15
[1] 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
[1] 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

AA 2023-24 19 / 453
Introducing R
Generating regular sequences

Problem
Define n <- 10. Try to see what happens if you write 1:n-1 and
1:(n-1). Think also of a way of building a backward sequence.

AA 2023-24 20 / 453
Introducing R
Generating regular sequences

The function seq is more general than the colon operator. It takes five
arguments (whose order is irrelevant if you name them).

1:10; seq(1,10); #seq(from=1,to=10); seq(to=10, from=1)

[1] 1 2 3 4 5 6 7 8 9 10
[1] 1 2 3 4 5 6 7 8 9 10

seq(10,1); seq(1,10,by=.5)

[1] 10 9 8 7 6 5 4 3 2 1
[1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5
[11] 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0

AA 2023-24 21 / 453
Introducing R
Generating regular sequences

You may check other uses of seq and the function rep

?seq; ?rep

AA 2023-24 22 / 453
Introducing R
Logical vectors

A logical vector takes values TRUE, FALSE and NA for “not available”.

temp <- x>13


temp

[1] FALSE FALSE FALSE FALSE TRUE

The values in temp tell us if the logical condition is met. Notice that such
condition is applied elementwise.

AA 2023-24 23 / 453
Introducing R
Logical vectors

The logical operators are <, <=, >, >=, ==, !=, where their meaning
should be obvious.

If you have two logical expressions, say, c1 and c2 you can operate on them
using the Boolean operators & that stands for and and | that stands for or.

They are particularly useful in subsetting data sets.

AA 2023-24 24 / 453
Introducing R
Logical vectors

Problem (Logical and dummies)


Consider the logical vector temp. Can you turn it into a vector of ones and
zeros?

AA 2023-24 25 / 453
Introducing R
Missing values

Missing values are ubiquitous in data sets. In R they are denoted with NA.
If we want to inspect an object to see if it is a missing we may use is.na

z<-c(1:3,NA); ind<-is.na(z);ind

[1] FALSE FALSE FALSE TRUE

Notice that is.xxx is a general structure in R.

AA 2023-24 26 / 453
Introducing R
Missing values

You may inquiry about the position of NA and you can remove it too

ind2<-which(is.na(z))#;ind2
z[-ind2]

[1] 1 2 3

We will see more about subsetting later on.

AA 2023-24 27 / 453
Introducing R
Missing values

A second concept of missing value is available and it is related to the result


of an operation

0/0

[1] NaN

NaN stands for “not a number”.

AA 2023-24 28 / 453
Introducing R
Missing values

Problem
Notice

a<-Inf-Inf
if(is.nan(a)==T){print("we do not have number here!")}

[1] "we do not have number here!"

if(is.na(a)==T){print("it is not available either!")}

[1] "it is not available either!"

What is the difference then?

AA 2023-24 29 / 453
Introducing R
Character vectors

Character vectors are used very often, in plot labels for example. We write
them as text between quotes: "x-values" or "New iteration
results". Let see an interesting example

labs <- paste(c("X","Y"), 1:10, sep="")


labs

[1] "X1" "Y2" "X3" "Y4" "X5" "Y6" "X7" "Y8"


[9] "X9" "Y10"

Notice that this configuration of the function paste produced the


(equivalent) vector through recycling

c("X1", "Y2", "X3", "Y4", "X5", "Y6", "X7", "Y8", "X9", "Y10")

[1] "X1" "Y2" "X3" "Y4" "X5" "Y6" "X7" "Y8"


[9] "X9" "Y10"
AA 2023-24 30 / 453
Introducing R
Character vectors

Problem
Try changing the terms of the problem in the generation of labs. Check
what happens if you add a further variable "Z".

AA 2023-24 31 / 453
Introducing R
Modifying vectors

We will explore now some ways to subset a vector. Suppose we want to get
rid of NAs

y <- x[!is.na(x)];y

[1] 10.4 5.6 3.1 6.4 21.7

If x has missing values then it is longer than y

length(x)>length(y)

[1] FALSE

AA 2023-24 32 / 453
Introducing R
Modifying vectors

There are many other ways to do what we saw in the previous slide. You
may want to explore the following situations

# generate a vector x containing 20 elements, then


x[1:10]
# what has happened? What about
x[-c(3:6)]

AA 2023-24 33 / 453
Introducing R
Modifying vectors

Problem
What about the following example? What do you observe?

fruit <- c(5, 10, 1, 20)


names(fruit) <- c("orange", "banana", "apple", "peach")
lunch <- fruit[c("apple","orange")]

Have you noticed the role played by [] in the examples seen so far?

AA 2023-24 34 / 453
Introducing R
Other objects

Despite the prominence of vectors there are other objects that are
extremely useful
1 matrices (or arrays)
2 factors to handle categorical data
3 lists, a generalization of the concept of vector that can contain objects
of not necessarily the same type or size, often used to store the output
of computations
4 data frames, like matrices but can store numerical and categorical
variables
5 functions, this is a story apart.

AA 2023-24 35 / 453
Introducing R
Objects: modes and attributes

The things or entities that live in R are called objects. Such objects have
special features.

For example, vector objects may contain numeric, logical or string values.
If the components are all of the same type we refer to that as an atomic
structure.

AA 2023-24 36 / 453
Introducing R
Objects: modes and attributes

Let us check the following examples

a1<-c(1,2,3);mode(a1)

[1] "numeric"

a2<-c(1,2,"yeah");mode(a2)

[1] "character"

a3<-c(1,2,NA);mode(a3)

[1] "numeric"

Through mode we learn what type of objects we are working with.

AA 2023-24 37 / 453
Introducing R
Objects: modes and attributes

Problem (typeof)
Replicate the above examples using typeof.

AA 2023-24 38 / 453
Introducing R
Objects: modes and attributes

We may force an object to be something else

z <- 0:9
digits <- as.character(z)
d <- as.integer(digits)
z;digits;d # accidentally z and d are the same

[1] 0 1 2 3 4 5 6 7 8 9
[1] "0" "1" "2" "3" "4" "5" "6" "7" "8" "9"
[1] 0 1 2 3 4 5 6 7 8 9

As in the is.xxx case, the methods of coercion are defined as as.xxx.

AA 2023-24 39 / 453
Introducing R
Changing the length of an object

It is possible to intervene on the length of an object and change its


elements.

e <- numeric() # empty object


e

numeric(0)

Let’s give it a value somewhere

e[3] <- 17
e

[1] NA NA 17

AA 2023-24 40 / 453
Introducing R
The class of an object

All objects in R are assigned to a class.

a1 <- 1:3; class(a1)

[1] "integer"

a2 <- FALSE; class(a2)

[1] "logical"

a3<-matrix(rnorm(10),5,2);class(a3)

[1] "matrix" "array"

Other classes exist.

AA 2023-24 41 / 453
Introducing R
Ordered and unordered factors

A factor is a vector object used to specify a discrete classification of the


components of other vectors of the same length.

R provides both ordered and unordered factors.

AA 2023-24 42 / 453
Introducing R
Ordered and unordered factors

Consider the following very fake example

state <- c("tas", "sa", "qld", "nsw", "nsw", "nt", "wa",


"wa","qld", "vic", "nsw", "vic", "qld", "qld", "sa", "tas",
"sa", "nt", "wa", "vic", "qld", "nsw", "nsw", "wa",
"sa", "act", "nsw", "vic", "vic", "act")

Let’s say that this is a vector of 30 tax accountants from all Australian
states.

AA 2023-24 43 / 453
Introducing R
Ordered and unordered factors

We can easily get the unique objects in the vector state

statef <- factor(state);statef;levels(statef)

[1] tas sa qld nsw nsw nt wa wa qld vic nsw vic


[13] qld qld sa tas sa nt wa vic qld nsw nsw wa
[25] sa act nsw vic vic act
Levels: act nsw nt qld sa tas vic wa
[1] "act" "nsw" "nt" "qld" "sa" "tas" "vic" "wa"

Also

unique(state)

[1] "tas" "sa" "qld" "nsw" "nt" "wa" "vic" "act"

AA 2023-24 44 / 453
Introducing R
Ordered and unordered factors

Suppose we have the incomes of the tax accountants in another vector and
we want to make some operation.

incomes <- c(60, 49, 40, 61, 64, 60, 59, 54, 62, 69, 70, 42,
56, 61, 61, 61, 58, 51, 48, 65, 49, 49, 41, 48, 52, 46,
59, 46, 58, 43)

We want to know the mean per state and we use the function tapply

# check
?tapply

AA 2023-24 45 / 453
Introducing R
Ordered and unordered factors

Suppose we have the incomes of the tax accountants in another vector and
we want to make some operation.

incmeans <- tapply(incomes, statef, mean)


incmeans

act nsw nt qld sa tas vic wa


44.5 57.3 55.5 53.6 55.0 60.5 56.0 52.2

AA 2023-24 46 / 453
Introducing R
Ordered and unordered factors

Let us do some explorations

uni<-unique(state) # unique values inside state


n.per.state<-c() # an empty vector

Let’s fill the empty vector.

AA 2023-24 47 / 453
Introducing R
Ordered and unordered factors

for(i in 1:length(uni)){
n.per.state[i]<- sum(1*(state==uni[i]))
}
names(n.per.state) <- uni
n.per.state # accountants per state

tas sa qld nsw nt wa vic act


2 4 5 6 2 4 5 2

This is the number of accountants per state. So we say that our array is
irregular or ragged. We applied the mean operator to each factor as if they
were different vectors.

AA 2023-24 48 / 453
Introducing R
Ordered and unordered factors

Problem (The same result but shorter)


Check the output of the following lines of code

n.per.state <- table(statef)


n.per.state <- tapply(statef, statef, length)

AA 2023-24 49 / 453
Introducing R
Ordered and unordered factors

Suppose we want to apply a more complicated function that is not


available by default in R, say, the standard error. Then,

stdError <- function(x) sqrt(var(x)/length(x))

We use it with tapply

incster <- tapply(incomes, statef, stdError)


incster

act nsw nt qld sa tas vic wa


1.50 4.31 4.50 4.11 2.74 0.50 5.24 2.66

AA 2023-24 50 / 453
Introducing R
Ordered and unordered factors

Problem
Devise a function that computes the t test and replicate the analysis we
have carried out in the previous slide.

AA 2023-24 51 / 453
Introducing R
Ordered and unordered factors

In R there is little difference between ordered and unordered factors and


can be both encompassed by the function factor, you may though check
the function ordered.

AA 2023-24 52 / 453
Introducing R
Arrays and matrices

An array can be considered as a multiply subscripted collection of data


entries, for example numeric. The type of array we are most interested in is
the matrix.

Suppose that a has dimensions c(3,4,2), this is a 3 × 4 × 2 array with 24


entries

a[2,,]

is a 4 × 2 array (or matrix). Similar subsetting operations can be performed


along the other dimensions. Clearly,

a[,,]

is the full array.

AA 2023-24 53 / 453
Introducing R
Arrays and matrices

Problem
Consider the following array

x <- array(1:20, dim=c(4,5)) # a 4 by 5 array


x

[,1] [,2] [,3] [,4] [,5]


[1,] 1 5 9 13 17
[2,] 2 6 10 14 18
[3,] 3 7 11 15 19
[4,] 4 8 12 16 20

How can we extract the elements x[1,3], x[2,2], x[3,1] and replace
them with zeros?

AA 2023-24 54 / 453
Introducing R
Arrays and matrices

Define another array

i <- array(c(1:3,3:1), dim=c(3,2)) # a 3 by 2 index array


i

[,1] [,2]
[1,] 1 3
[2,] 2 2
[3,] 3 1

AA 2023-24 55 / 453
Introducing R
Arrays and matrices

Notice that

x[i]

[1] 9 6 3

AA 2023-24 56 / 453
Introducing R
Arrays and matrices

So we can just replace

x[i]<-0
x

[,1] [,2] [,3] [,4] [,5]


[1,] 1 5 0 13 17
[2,] 2 0 10 14 18
[3,] 0 7 11 15 19
[4,] 4 8 12 16 20

AA 2023-24 57 / 453
Introducing R
Arrays and matrices

For further details

?array

AA 2023-24 58 / 453
Introducing R
Arrays and matrices

An important operation for arrays is the outer product. Given two arrays
the outer product is

ab <- a %o% b
# alternatively
ab <- outer(a, b, "*")

If a and b are two numeric arrays, their outer product is an array whose
dimension vector is obtained by concatenating their two dimension vectors,
and whose data vector is got by forming all possible products of elements
of the data vector of a with those of b.

AA 2023-24 59 / 453
Introducing R
Arrays and matrices

It sounds very complicated, let’s have a look at an example

a<-matrix(rnorm(6),3,2)
b<-c(1,2)
ab <- a %o% b;dim(ab)

[1] 3 2 2

AA 2023-24 60 / 453
Introducing R
Arrays and matrices

ab

, , 1

[,1] [,2]
[1,] 0.036 2.512
[2,] -0.774 -0.681
[3,] -0.279 -0.814

, , 2

[,1] [,2]
[1,] 0.0721 5.02
[2,] -1.5473 -1.36
[3,] -0.5571 -1.63
AA 2023-24 61 / 453
Introducing R
Arrays and matrices

We conclude the discussion on the arrays and move to something that we


will use way more often, i.e. matrices.

We can perform matrix operations similarly to the scalar case, given the
caveats that come from conformability and lack of commutativity for
products.

We can multiply matrices elementwise (*) or row-by-column (%*%),


summation and subtraction are standard (+,-), division is a bit more
intricate.

AA 2023-24 62 / 453
Introducing R
Arrays and matrices

Matrix transposition is easily achieved via t(X).

Notice that the operation t(X)%*%y can be more efficiently performed with
crossprod(X,y).

The meaning of diag depends on the argument. If v is a vector diag(v) is


a diagonal matrix, if M is a matrix diag(M) extracts its diagonal elements.

AA 2023-24 63 / 453
Introducing R
Arrays and matrices

Problem (Ik )
Let k be a number. What happens if you compute diag(k)?

AA 2023-24 64 / 453
Introducing R
Arrays and matrices

Matrix inversion is fundamental to solve systems of equations and in our


context we are interested in least squares estimation and its generalizations.

If A and b are known and we want to know x we can type solve(A,b).


Notice that solve(A) produces the inverse of A.

AA 2023-24 65 / 453
Introducing R
Arrays and matrices

We can easily build matrices by combining other matrices or vectors using


cbind and rbind. cbind() forms matrices by binding together matrices
horizontally, or columnwise, and rbind() vertically, or rowwise.

Notice that a recycling rule applies here. If X1 and X2 are conformable


matrices X<-cbind(1,X1,X2) is still a matrix.

AA 2023-24 66 / 453
Introducing R
Lists and data frames

An R list is an object consisting of an ordered collection of objects known


as its components. The list’s components may well be heterogeneous.

Lst <- list(name="Fred", wife="Mary", no.children=3,


child.ages=c(4,7,9))

Notice that components are always numbered and we can extract them
using the corresponding number value

Lst[[1]];Lst[[2]];Lst[[3]];Lst[[4]]

[1] "Fred"
[1] "Mary"
[1] 3
[1] 4 7 9

AA 2023-24 67 / 453
Introducing R
Lists and data frames

We can also extract the individual elements of the components of a list

Lst[[1]][1];Lst[[2]][2];Lst[[3]][2];Lst[[4]][2]

[1] "Fred"
[1] NA
[1] NA
[1] 7

AA 2023-24 68 / 453
Introducing R
Lists and data frames

We can also access the components using their corresponding name via the
operator $

Lst$name;Lst$wife;Lst$child.ages[1]

[1] "Fred"
[1] "Mary"
[1] 4

AA 2023-24 69 / 453
Introducing R
Lists and data frames

There is a difference between [] and [[]]

Lst[[1]]

[1] "Fred"

Lst[1]

$name
[1] "Fred"

The former is the first object in the Lst, and if it is a named list the name
is not included. The latter is a sublist of the list Lst consisting of the first
entry only. If it is a named list, the names are transferred to the sublist.

AA 2023-24 70 / 453
Introducing R
Lists and data frames

We can easily build lists

Lst <- list(name_1=object_1, ..., name_m=object_m)

or modify existing ones

Lst[5] <- list(matrix=Mat)

In addition the operator c() can be used to combine lists

list.ABC <- c(list.A, list.B, list.C)

AA 2023-24 71 / 453
Introducing R
Lists and data frames

The data.frame class handles data sets.


The components must be vectors (numeric, character, or
logical),factors,numeric matrices, lists, or other data frames.
Matrices, lists, and data frames provide as many variables to the new
data frame as they have columns, elements, or variables, respectively.
Numeric vectors, logicals and factors are included as is, and by default
character vectors are coerced to be factors, whose levels are the
unique values appearing in the vector.
Vector structures appearing as variables of the data frame must all
have the same length, and matrix structures must all have the same
row size.

AA 2023-24 72 / 453
Introducing R
Lists and data frames

If the restrictions above are met we may use those objects to construct a
data.frame

# generate a factor variable for income class


incomef <- factor(cut(incomes, breaks = 35+10*(0:7)))
# build a data frame
accountants <- data.frame(home=statef,
loot=incomes, shot=incomef)

AA 2023-24 73 / 453
Introducing R
Lists and data frames

head(accountants)

home loot shot


1 tas 60 (55,65]
2 sa 49 (45,55]
3 qld 40 (35,45]
4 nsw 61 (55,65]
5 nsw 64 (55,65]
6 nt 60 (55,65]

AA 2023-24 74 / 453
Introducing R
Lists and data frames

We can also coerce an object to be a data frame with as.data.frame or


by importing data using for example read.table.

AA 2023-24 75 / 453
Introducing R
Lists and data frames

There are many ways to read a data frame. We will revise some basic
approaches.
When reading a data frame from an external file you expect it to have a
special form
the first line of the file should have a name for each variable in the
data frame
each additional line of the file has as its first item a row label and the
values for each variable.

AA 2023-24 76 / 453
Introducing R
Lists and data frames

If you have a data set, say, houses.data, you can import it using

HousePrice <- read.table("houses.data")

provided that the path is the right one. It is good practice to check for
examples that work and

?read.table

AA 2023-24 77 / 453
Introducing R
Lists and data frames

R contains packages that contain data sets. In this case you can use the
function data

#install.packages("datasets")
library(datasets)
data(infert)
#head(infert)

AA 2023-24 78 / 453
Introducing R
Lists and data frames

Problem
Could you get data from an URL?

AA 2023-24 79 / 453
Introducing R
Lists and data frames

More modern ways to upload data sets from outside sources are provided
by the packages

library(foreign)

and in particular by

library(haven)

The latter is the best tool I have found thus far.

AA 2023-24 80 / 453
Introducing R
Probability distributions

R has the ability to deal with a large number of probability distributions.


This is very useful for simulation as well as estimation purposes.

The command generally is based on the name of the distribution. For


example, the normal distribution has norm and we can add prefixes to
generate quantities of interest. For example

rnorm(1)

[1] -0.858

draws one number from a standard normal.

AA 2023-24 81 / 453
Introducing R
Probability distributions

We can change the arguments of the distribution to sample from a


different normal

rnorm(1, mean = 10, sd=5)

[1] 7.06

AA 2023-24 82 / 453
Introducing R
Probability distributions

The prefixes q, p, d indicate the quantile, the pdf and the cdf respectively

## 2-tailed p-value for t distribution


2*pt(-2.43, df = 13)

[1] 0.0303

## upper 1% point for an F(2, 7) distribution


qf(0.01, 2, 7, lower.tail = FALSE)

[1] 9.55

AA 2023-24 83 / 453
Introducing R
Probability distributions

Problem
What does it mean lower.tail = FALSE in the above example?

AA 2023-24 84 / 453
Introducing R
Probability distributions

We have a number of tools to examine the characteristics of a data set.


Consider the example

## sample some data from a normal


x<-rnorm(100); hist(x, prob=TRUE)

Histogram of x
0.30
0.25
0.20
Density

0.15
0.10
0.05
0.00

−2 −1 0 1 2 3

AA 2023-24 85 / 453
Introducing R
Probability distributions

We can add more detail

## sample some data from a normal


hist(x, prob=TRUE);lines(density(x));rug(x)

Histogram of x
0.30
0.25
0.20
Density

0.15
0.10
0.05
0.00

−2 −1 0 1 2 3

AA 2023-24 86 / 453
Introducing R
Probability distributions

Problem
Repeat the above example by first increasing the sample size and then
using asymmetric distributions. Find a way to incorporate the true
probability density in your graph.

AA 2023-24 87 / 453
Introducing R
Grouping, loops and conditional execution

In R we can group expressions within braces

{set.seed(1234); n<-100; b<- 4


x<-rnorm(100);eps<-rnorm(100)
y<-x*b+eps
b.hat<-cov(x,y)/var(x)
eval<-c("simulated data for regression")
}

Everything within braces is an expression (and is evaluated) and the object


in braces is an expression itself.

AA 2023-24 88 / 453
Introducing R
Grouping, loops and conditional execution

In fact

b.hat

[1] 3.97

AA 2023-24 89 / 453
Introducing R
Grouping, loops and conditional execution

Notice that the value of the whole thing is the last line...this is a bit
confusing...Let’s change the example slightly

{set.seed(1234); n<-100; b<- 4


x<-rnorm(100);eps<-rnorm(100)
y<-x*b+eps
b.hat<-cov(x,y)/var(x)
eval<-c("simulated data for regression")
} -> sim.ols

The whole object is assigned to sim.ols

sim.ols

[1] "simulated data for regression"

AA 2023-24 90 / 453
Introducing R
Grouping, loops and conditional execution

Why do we care?

Many useful operations involve group expressions. Some of them are


important for us, such as loops and functions.

AA 2023-24 91 / 453
Introducing R
Grouping, loops and conditional execution

We first consider the function ifelse, the statement if and the function
switch

pizza.ingredients<-c("flour","water","oil",
"tomato","mozzarella","pineapple")
ifelse("pineapple" %in% pizza.ingredients,
"THE END IS NIGH!!!",
"You are on the right track")

[1] "THE END IS NIGH!!!"

AA 2023-24 92 / 453
Introducing R
Grouping, loops and conditional execution

The function ifelse takes as arguments a condition and depending on the


veracity of such condition it returns one statement or another

The ifelse function is not fundamental but it is commonly used and it is


worth knowing.

AA 2023-24 93 / 453
Introducing R
Grouping, loops and conditional execution

The if statement is maybe more flexible

if("pineapple" %in% pizza.ingredients){


print("ARRGHHH!")
pizza.ingredients<-pizza.ingredients[-which(
pizza.ingredients=="pineapple")]
}

[1] "ARRGHHH!"

if("pineapple" %in% pizza.ingredients){


print("ARRGHHH!")
} else {print("add gorgonzola and sausage")}

[1] "add gorgonzola and sausage"

AA 2023-24 94 / 453
Introducing R
Grouping, loops and conditional execution

Problem
Prevent Armageddon by replacing pineapple with basil. Also add
gorgonzola and sausage.

AA 2023-24 95 / 453
Introducing R
Grouping, loops and conditional execution

The switch function returns one of several possible values depending on


the input.

a.number<-sample(c(1,2))[1]
switch(a.number,"number one","number two")

[1] "number two"

switch can also evaluate a string expression.

AA 2023-24 96 / 453
Introducing R
Grouping, loops and conditional execution

Check the following example.

require(stats)
centre <- function(x, type) {
switch(type,
mean = mean(x),
median = median(x),
trimmed = mean(x, trim = .1))
}
x <- rcauchy(1000)
centre(x, "mean");centre(x, "median");centre(x, "trimmed")

[1] -0.879
[1] -0.0389
[1] -0.0828

AA 2023-24 97 / 453
Introducing R
Grouping, loops and conditional execution

The for loop is one of the most common methods to run repeated
operations.

Let us consider an example: we want to evaluate how good a t test is in


the context of a linear regression.

You may remember that the t test is asymptotically normal or t distributed


depending on the assumptions you use.

If we run a two tail 5% test and we could repeat this experiment many
times our test would reject the null hypothesis 5% of the times.

AA 2023-24 98 / 453
Introducing R
Grouping, loops and conditional execution

Let us see an example

n<-100;reps<- 1000
b2.hat<-t.test<-matrix(NA,reps,1)
for(i in 1:reps){
# simulate the data
x<-rnorm(n);e<-rnorm(n);y<- 1+2*x+e;X<-cbind(1,x)
# estimate the parameters
b.hat<-solve(crossprod(X))%*%crossprod(X,y)
e.hat<- y-X%*%b.hat
var.b<-as.numeric(crossprod(e.hat)/(n-2))*
solve(crossprod(X))
b2.hat[i]<-b.hat[2]
t.test[i]<-(b2.hat[i]-2)/sqrt(var.b[2,2])
}

AA 2023-24 99 / 453
Introducing R
Grouping, loops and conditional execution

t.size<-mean(1*(abs(t.test)>qnorm(0.975)))
t.size # simulated type I error for a 5% nominal size test

[1] 0.053

This seems interesting. So, what does that actually say? And why do we
care?

AA 2023-24 100 / 453


Introducing R
Grouping, loops and conditional execution

Notice that our choices for the design of the data generating process
(DGP) match those of the classical linear regression model (why?).

Our results seem to be consistent with our assumptions.

AA 2023-24 101 / 453


Introducing R
Grouping, loops and conditional execution

Problem
Try to run a similar experiment but this time consider different distributions
for the error term, e.g., a t distribution with 5 degrees of freedom and a
Cauchy distribution. In addition, using the same setup study the properties
of the 95% confidence interval for the same parameter. What do you
observe?

AA 2023-24 102 / 453


Introducing R
Grouping, loops and conditional execution

Other important functions are while and repeat. The former executes an
expression until the requirements of a logical condition are met. For
example

i<-0
while(i<=10){i<-i+1; if(i==10){print("we are done now")}}

[1] "we are done now"

The latter basically repeats an expression and it is usually accompanied by


the instruction break, that breaks off the loop.

AA 2023-24 103 / 453


Introducing R
Grouping, loops and conditional execution

Notice that R is a highly vectorized language. This means that you can
often work out your problem by using vector operations instead of loops.
This generally makes operations more time efficient.

AA 2023-24 104 / 453


Introducing R
Writing your own functions

One of the most powerful features of R is the possibility of writing


functions.

While there are uncountable packages that perform the most diverse array
of operations, you may still find writing your own function useful.

AA 2023-24 105 / 453


Introducing R
Writing your own functions

Let us consider a practical example. Suppose we want to compute the OLS


estimator for a given set of data. We want to build a function that takes
care of that.

# load the AER package


library(AER)

# load the the data set in the workspace


data(CASchools)

The above code retrieves a data set on Californian schools from the
package AER

AA 2023-24 106 / 453


Introducing R
Writing your own functions

# what's in there?
# head(CASchools)
# str(CASchools)
names(CASchools)

[1] "district" "school" "county"


[4] "grades" "students" "teachers"
[7] "calworks" "lunch" "computer"
[10] "expenditure" "income" "english"
[13] "read" "math"

AA 2023-24 107 / 453


Introducing R
Writing your own functions

Suppose we are interested in studying the relationship between the


student/teacher ratio (x) and the average test scores (y ). Notice that
these variables are not available so we need to build them

# compute STR and append it to CASchools


CASchools$STR <- CASchools$students/CASchools$teachers

# compute TestScore and append it to CASchools


CASchools$score <- (CASchools$read + CASchools$math)/2

AA 2023-24 108 / 453


Introducing R
Writing your own functions

my.ols<-function(y,x,const){
y<-as.matrix(y)
x<-as.matrix(x)
if(const==TRUE){
X<-cbind(1,x)
}else{X<-x}
b.hat<-solve(t(X)%*%X)%*%(t(X)%*%y)
b.hat
}

AA 2023-24 109 / 453


Introducing R
Writing your own functions

Let’s cross our fingers and hope it works

ols1<-my.ols(y=CASchools$score,x=CASchools$STR,const=TRUE)
ols1

[,1]
[1,] 698.93
[2,] -2.28

This example comes from Introduction to Econometrics with R by C.


Hanck, M. Arnold, A. Gerber and M. Schmelzer. You may want to double
check to see whether the results are correct.

AA 2023-24 110 / 453


Introducing R
Writing your own functions

Problem
Integrate the output of my.ols with a t test for the null hypothesis that β1
and β2 are zero and the corresponding p-values.

AA 2023-24 111 / 453


Introducing R
Writing your own functions

There are some features that we may find useful when working with
function.
Notice that when we specify the arguments their order does not matter.
my.ols(y=CASchools$score,x=CASchools$STR,const=TRUE)

[,1]
[1,] 698.93
[2,] -2.28

my.ols(x=CASchools$STR,y=CASchools$score,const=TRUE)

[,1]
[1,] 698.93
[2,] -2.28

my.ols(CASchools$score,CASchools$STR,TRUE)

[,1]
[1,] 698.93
[2,] -2.28

Keep in mind the last case.


AA 2023-24 112 / 453
Introducing R
Writing your own functions

Consider the following situation

my.ols(y=CASchools$score,x=CASchools$STR)

Error in my.ols(y = CASchools$score, x = CASchools$STR):


argomento "const" assente, senza valore predefinito

my.ols(CASchools$score,TRUE,CASchools$STR)

Error in if (const == TRUE) {: la condizione ha lunghezza >


1

AA 2023-24 113 / 453


Introducing R
Writing your own functions

In many cases our function may call other functions for which we may need
to specify its associated arguments. This can be done using ..., for
example

foo <- function(data, graph=TRUE, limit=20, ...) {


[omitted statements]
if (graph)
par(pch="*", ...)
[more omissions]
}

We do not need to specify the arguments for the graphical device unless we
want to.

Note that any assignment within the function is local to the function.

AA 2023-24 114 / 453


Introducing R
Writing your own functions

There is clearly more to the world of functions in R, what we have


introduced so far is sufficient to deal with the examples that we will see
later on during the course.

AA 2023-24 115 / 453


Introducing R
Simple graphs

We are tackling this aspect only marginally to dedicate more time to the
study of the tidyverse packages.

One of the reasons of R’s success is its ability to produce beautiful graphs.

At this point of our journey our graphs will be mostly ugly.

AA 2023-24 116 / 453


Introducing R
Simple graphs

Plotting commands are divided into three basic groups.


1 High-level plotting functions create a new plot on the graphics device,
possibly with axes, labels, titles and so on.
2 Low-level plotting functions add more information to an existing plot,
such as extra points, lines and labels.
3 Interactive graphics functions allow you interactively add information
to, or extract information from, an existing plot, using a pointing
device such as a mouse.
In addition, there exists a list of graphical parameters which can be
manipulated to customize your plots.

AA 2023-24 117 / 453


Introducing R
Simple graphs

We will see some practical examples using the plot function.

# Example from Kleiber&Zeileis


library(AER)
data("Journals")
Journals$citeprice <- Journals$price/Journals$citations
attach(Journals)

AA 2023-24 118 / 453


Introducing R
Simple graphs

We will see some practical examples using the plot function.


# Example from Kleiber&Zeileis
plot(log(subs), log(citeprice))
rug(log(subs))
rug(log(citeprice), side = 2)
2
0
log(citeprice)

−2
−4

1 2 3 4 5 6 7

log(subs)

AA 2023-24 119 / 453


Introducing R
Simple graphs

We can modify the appearance of our graph by operating on certain


graphical parameters.

You can for example change the points into lines by specifying type="l" in
the function plot. Further changes may be applied also via the function
par. Check

?plot
?par

AA 2023-24 120 / 453


Introducing R
Simple graphs

# Example from Kleiber&Zeileis


plot(log(subs) ~ log(citeprice), data = Journals, pch = 20,
col = "blue", ylim = c(0, 8), xlim = c(-7, 4),
main = "Library subscriptions")

Library subscriptions
8
6
log(subs)

4
2
0

−6 −4 −2 0 2 4

log(citeprice)

AA 2023-24 121 / 453


Introducing R
Simple graphs

# Example from Kleiber&Zeileis


plot(log(subs) ~ log(citeprice), data = Journals, pch = 20,
col = "blue", ylim = c(0, 8), xlim = c(-7, 4),
main = "Library subscriptions")
text(-3.798, 5.846, "Econometrica", pos = 2)
Library subscriptions
8
6

Econometrica
log(subs)

4
2
0

−6 −4 −2 0 2 4

log(citeprice)

AA 2023-24 122 / 453


Introducing R
Simple graphs

We can alter the plot by using further functions such as lines, points,
legend, abline.

More specific plots are also available: barplot, boxplot, qqplot, hist.

For a non exaustive set of examples

demo("graphics")

AA 2023-24 123 / 453


Introducing R
Simple graphs

In some occasions we may be interested in plotting curves from a


prespecified function such as a density function.

AA 2023-24 124 / 453


Introducing R
Simple graphs

# Example from Kleiber&Zeileis


curve(dnorm, from = -5, to = 5, col = "red", lwd = 3,
main = "Gaussian density")
text(-5, 0.3, expression(f(x) == frac(1, sigma ~~
sqrt(2*pi)) ~~ e^{-frac((x - mu)^2, 2*sigma^2)}), adj = 0)
Gaussian density
0.4

1 (x−µ)2
f(x) =

0.3

e 2

σ 2π
dnorm(x)

0.2
0.1
0.0

−4 −2 0 2 4

AA 2023-24 125 / 453


Introducing R
Optimization

R has various options to do optimization.

We can very roughly distinguish between one parameter optimization and


more than one parameter optimization. For the former we will refer to the
function optimize for the latter to optim.

?optimize
?optim

You can also check the specific optimization task view in R (link) to check
for new developments on the relevant packages.

AA 2023-24 126 / 453


Introducing R
Optimization

Consider the following wobbly function

f <- function(x) sin(x) + sin(2 * x) + cos(3 * x)


curve(f, from = 0, to = 2 * pi)
1
0
f(x)

−1
−2

0 1 2 3 4 5 6

AA 2023-24 127 / 453


Introducing R
Optimization

We want to find an optimum. What does that mean?

We are essentially looking at the tippy points of the function...but are we


looking for those that are pointing upward or downward?

It depends.

AA 2023-24 128 / 453


Introducing R
Optimization

The function optimize computes the minimum by default. To make it


work we need to provide an interval where the optimization algorithm
searches for the optimum and the function itself

optimize(f, interval = c(0, 2 * pi))

$minimum
[1] 3.03

$objective
[1] -1.05

AA 2023-24 129 / 453


Introducing R
Optimization

If we are after a maximum

optimize(f, interval = c(0, 2 * pi), maximum = TRUE)

$maximum
[1] 4.06

$objective
[1] 1.1

AA 2023-24 130 / 453


Introducing R
Optimization

Let us see whether we can apply this approach with a more familiar
problem, the OLS estimator.

Consider the model

yi = α + βxi + ε

The corresponding objective function is


n
X
Qn (α, β) = (yi − α − βxi )2
i=1

We now that this problem has a close form solution but for the sake of the
example we try to solve the problem numerically.

AA 2023-24 131 / 453


Introducing R
Optimization

Define the objective function

Q<-function(theta){
theta<-as.vector(theta)
y<-as.matrix(y)
x<-as.matrix(x)
X<-cbind(1,x)
e<-y-X%*%theta
Q<-crossprod(e)
}

AA 2023-24 132 / 453


Introducing R
Optimization

Consider the Journals data set we used earlier

attach(Journals)
y<-log(subs); x<-log(citeprice)
ols2<-optim(c(0,0),Q,method = "BFGS");ols2$par

[1] 4.766 -0.533

lm(y~x)$coefficients

(Intercept) x
4.766 -0.533

AA 2023-24 133 / 453


Introducing R
Optimization

Let us now describe a somewhat complicated likelihood problem.

As we all know maximum likelihood estimators are the result of


optimization problem of often complex objective functions.

AA 2023-24 134 / 453


Introducing R
Optimization

Consider a random variable x and suppose it is normally distributed with


unknown mean µ and variance σ 2 .

The corresponding loglikelihood function is


n
2 log(2π) log(σ 2 ) 1 X
ℓ(µ, σ ) = −n −n − 2 (xi − µ)2
2 2 2σ
i=1

AA 2023-24 135 / 453


Introducing R
Optimization

The corresponding code could be

gauss.ll<-function(theta,x){
mu<-theta[1]
sig2<-theta[2]
x<-as.matrix(x)
n<-nrow(x)
ll<- -.5*n*log(2*pi)-.5*n*log(sig2)-sum((x-mu)^2)/(2*sig2)
return(-ll)
}

AA 2023-24 136 / 453


Introducing R
Optimization

Let us feed data to our code

set.seed(12345)
x<-rnorm(100,1,2)
gauss.ll.opt<-optim(c(1,4),gauss.ll,x=x)
gauss.ll.opt$par

[1] 1.49 4.92

We may check what happens if we increase the sample size.

AA 2023-24 137 / 453


Introducing R
Optimization

There is another possible way to define the likelihood by means of a built-in


function. Notice that
n   
2
X xi − µ
ℓ(µ, σ ) = −n log(σ) + log ϕ
σ
i=1

where ϕ is the Gaussian density.

AA 2023-24 138 / 453


Introducing R
Optimization

The corresponding code could be

gauss.ll2<-function(theta,x){
mu<-theta[1]
sig<-theta[2]
x<-as.matrix(x)
n<-nrow(x)
z<-(x-mu)/sig
ll<- -n*log(sig)+sum(dnorm(z,log=TRUE))
return(-ll)
}

AA 2023-24 139 / 453


Introducing R
Optimization

set.seed(12345)
x<-rnorm(100,1,2)
gauss.ll.opt2<-optim(c(1,4),gauss.ll2,x=x)
gauss.ll.opt2$par

[1] 1.49 2.22

AA 2023-24 140 / 453


Introducing R
Optimization

We have obtained the estimates for the parameters of our model.

How can we get variances and, consequently, standard errors?

Recall from your statistics classes that in maximum likelihood there is a


relationship between variance and second derivatives of the likelihood
function.

AA 2023-24 141 / 453


Introducing R
Optimization

set.seed(12345)
x<-rnorm(100,1,2)
gauss.ll.opt2<-optim(c(1,4),gauss.ll2,x=x, hessian = TRUE)
gauss.ll.opt2$par

[1] 1.49 2.22

FI<-solve(gauss.ll.opt2$hessian);FI #Fisher information

[,1] [,2]
[1,] 4.92e-02 -4.05e-06
[2,] -4.05e-06 2.46e-02

se<-sqrt(diag(FI));se

[1] 0.222 0.157


AA 2023-24 142 / 453
Introducing R
Optimization

Let us consider now a more involved model. Define a generalized


Cobb-Douglas function

Yi exp(θYi ) = exp(β1 )Kiβ2 Lβi 3 exp(εi )

where Y is output, K capital and L labour. A log transformation yields

log(Yi ) + θYi = β1 + β2 log(Ki ) + β3 log(Li ) + εi

This example is from Kleiber and Zeileis, see also Zellner and Ryu (1998)
and Greene (2003, Chapter 17).

AA 2023-24 143 / 453


Introducing R
Optimization

Let us assume εi ∼ N(0, σ 2 ). Then the likelihood function is


n    
Y εi 1 + θYi
L= ϕ .
σ Yi
i=1

∂εi 1+θYi
Notice that ∂Yi = Yi . The loglikelihood is then
n n ε 
i
X X
ℓ= (log(1 + θYi ) − log(Yi )) − log ϕ .
σ
i=1 i=1

The problem reduces to estimating the parameter vector (β1 , β2 , β3 , θ, σ 2 )′


via the maximization of ℓ.

AA 2023-24 144 / 453


Introducing R
Optimization

Consider the Equipment data set on transportation equipment


manufacturing from the AER package.

data("Equipment", package = "AER")

AA 2023-24 145 / 453


Introducing R
Optimization

prodf.ll <- function(pars){


betas<-pars[1:3];theta<-pars[4];sig2 <-pars[5]
Y <-with(Equipment, valueadded/firms)
K <-with(Equipment, capital/firms)
L <-with(Equipment, labor/firms)
lhs<- log(Y)+theta*Y
rhs<- betas[1]+betas[2]*log(K)+betas[3]*log(L)
ll<- sum(log(1+theta*Y)-log(Y)+
dnorm(lhs,mean = rhs,sd=sqrt(sig2),log = TRUE))
return(-ll)
}

AA 2023-24 146 / 453


Introducing R
Optimization

We use optim to proceed with the numerical optimization of the


loglikeliood. We need good initial values.

init.val<- lm(log(valueadded/firms) ~ log(capital/firms) +


log(labor/firms), data = Equipment)
pars0 <- as.vector(c(coef(init.val), 0,
mean(residuals(init.val)^2)))

AA 2023-24 147 / 453


Introducing R
Optimization

opt.prodf.ll <- optim(pars0, prodf.ll, hessian = TRUE)


opt.prodf.ll$par

[1] 2.9147 0.3500 1.0923 0.1067 0.0427

sqrt(diag(solve(opt.prodf.ll$hessian)))[1:4]

[1] 0.3606 0.0967 0.1408 0.0585

-opt.prodf.ll$value

[1] -8.94

AA 2023-24 148 / 453


Introducing the tidyverse

The goal of this part is to introduce some concepts of modern data science.

The main reference for this section is R for Data Science by Wickham and
Grolemund. This is a great book, clearly written and free.

AA 2023-24 149 / 453


Introducing the tidyverse

A typical data science project includes the following steps


1 Import: take data stored in a file, database, or web and load it into a
data frame in R.
2 Tidy: store the data in a way that matches the semantics of the
dataset with the way it is stored; when your data is tidy, each column
is a variable, and each row is an observation.
1 Transform: narrowing the sample, modifying the existing variables,
computing sample statistics (tidying+transforming=wrangling)
2 Visualize: self explanatory, helps to understand the data and to refine
your research question
3 Model: apply computational and mathematical tools to your data
3 Communicate: extend your results to others.

AA 2023-24 150 / 453


Introducing the tidyverse

Let us first install a bunch of packages

#install.packages("tidyverse")
library(tidyverse)

We will use other packages for examples

#install.packages(c("nycflights13", "gapminder", "Lahman"))


library(nycflights13, gapminder)
library(Lahman)

AA 2023-24 151 / 453


Introducing the tidyverse
Visualization

As we have done so far we will proceed by example.

We will start with data visualization.

AA 2023-24 152 / 453


Introducing the tidyverse
Visualization

Let us use the car dataset mpg. We want to ask the following question: do
cars with big engines use more fuel than cars with small engines?

Let us see some of the variables

AA 2023-24 153 / 453


Introducing the tidyverse
Visualization

#?mpg
mpg

# A tibble: 234 x 11
manufacturer model displ year cyl trans drv
<chr> <chr> <dbl> <int> <int> <chr> <chr>
1 audi a4 1.8 1999 4 auto~ f
2 audi a4 1.8 1999 4 manu~ f
3 audi a4 2 2008 4 manu~ f
4 audi a4 2 2008 4 auto~ f
5 audi a4 2.8 1999 6 auto~ f
6 audi a4 2.8 1999 6 manu~ f
7 audi a4 3.1 2008 6 auto~ f
8 audi a4 quatt~ 1.8 1999 4 manu~ 4
9 audi a4 quatt~ 1.8 1999 4 auto~ 4
10 audi a4 quatt~ 2 2008 4 manu~ 4
# i 224 more rows
# i 4 more variables: cty <int>, hwy <int>, fl <chr>,
# class <chr>

AA 2023-24 154 / 453


Introducing the tidyverse
Visualization

mpg contains observations collected by the US Environmental Protection


Agency on 38 models of car.

Among the variables in mpg are:


1 displ, a car’s engine size, in litres.
2 hwy, a car’s fuel efficiency on the highway, in miles per gallon (mpg).

AA 2023-24 155 / 453


Introducing the tidyverse
Visualization

Let us start with some graphical representation of the data

ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))

40

30
hwy

20

2 3 4 5 6 7
displ

AA 2023-24 156 / 453


Introducing the tidyverse
Visualization

We created our first ggplot, somehow out of the blue.

What are the ingredients of ggplot?

AA 2023-24 157 / 453


Introducing the tidyverse
Visualization

# an empty plot, not very interesting

ggplot(data = mpg)

AA 2023-24 158 / 453


Introducing the tidyverse
Visualization

To make the plot more informative we have to add more layers,


ggplot(data = mpg) only acquires the data.

We used a function geom_point to map points into the graph. This is but
one of the possible so called geom functions.

AA 2023-24 159 / 453


Introducing the tidyverse
Visualization

The general structure is

ggplot(data = <DATA>) +
<GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))

The aesthetic object aes takes, e.g., y and x as arguments. By specifying


our variables we associate what goes into the y axis and what in the x axis.

In general, an aesthetic is a visual property of the objects in your plot.


Aesthetics include things like the size, the shape, or the color of your points.

AA 2023-24 160 / 453


Introducing the tidyverse
Visualization

Let us look again at the plot: at the right hand side there are cars that are
surprisingly efficient.

ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))

40

30
hwy

20

2 3 4 5 6 7
displ

AA 2023-24 161 / 453


Introducing the tidyverse
Visualization

They may be hybrid cars. Let us look at the variable class and map it to
the aesthetic color.

AA 2023-24 162 / 453


Introducing the tidyverse
Visualization

ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy,
color = class))

40

class
2seater
30 compact
midsize
hwy

minivan
pickup
subcompact
suv

20

2 3 4 5 6 7
displ

AA 2023-24 163 / 453


Introducing the tidyverse
Visualization

Well, they turn out to be sport cars and not hybrids. This is because they
have a big engine but a small body which improves their efficiency.

Let us try to see if we use another aesthetic: size.

AA 2023-24 164 / 453


Introducing the tidyverse
Visualization

ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy,
size = class))

40

class
2seater
30 compact
midsize
hwy

minivan
pickup
subcompact
suv

20

2 3 4 5 6 7
displ

AA 2023-24 165 / 453


Introducing the tidyverse
Visualization

Alternatively, we could map class to the alpha aesthetic, which controls


the transparency of the points or to the shape aesthetic, which controls
the shape of the points.

AA 2023-24 166 / 453


Introducing the tidyverse
Visualization

ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy,
alpha = class))

40

class
2seater
30 compact
midsize
hwy

minivan
pickup
subcompact
suv

20

2 3 4 5 6 7
displ

AA 2023-24 167 / 453


Introducing the tidyverse
Visualization

ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy,
shape = class))

40

class
2seater
30 compact
midsize
hwy

minivan
pickup
subcompact
suv

20

2 3 4 5 6 7
displ

Warning: The shape palette can deal with a maximum of 6 discrete values
because more than 6 becomes difficult to discriminate; you have 7.
AA 2023-24 168 / 453
Introducing the tidyverse
Visualization

Consider the following

ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy),
color = "blue")

40

30
hwy

20

2 3 4 5 6 7
displ
AA 2023-24 169 / 453
Introducing the tidyverse
Visualization

We can produce aesthetic results by changing the properties of the geom


and it goes outside aes.

The change we induced is not very informative though.

AA 2023-24 170 / 453


Introducing the tidyverse
Visualization

Another way to add categorical variables into our plot is to use facets

The function facet_wrap takes as argument a formula which is introduced


by ∼

AA 2023-24 171 / 453


Introducing the tidyverse
Visualization

ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ class, nrow = 2)
2seater compact midsize minivan

40

30

20
hwy

2 3 4 5 6 7
pickup subcompact suv

40

30

20

2 3 4 5 6 7 2 3 4 5 6 7 2 3 4 5 6 7
displ

AA 2023-24 172 / 453


Introducing the tidyverse
Visualization

Let us now study the properties of the geoms.

geoms are geometrical objects that a plot uses to represent data.

Examples of geoms are bars, lines, points and so on.

Let us see some examples.

AA 2023-24 173 / 453


Introducing the tidyverse
Visualization

# points, as before
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))

40

30
hwy

20

2 3 4 5 6 7
displ

AA 2023-24 174 / 453


Introducing the tidyverse
Visualization

# now smooth a line!


ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy))
35

30
hwy

25

20

2 3 4 5 6 7
displ

AA 2023-24 175 / 453


Introducing the tidyverse
Visualization

For every geom we may specify some specific aesthetics.

AA 2023-24 176 / 453


Introducing the tidyverse
Visualization

# let's differentiate by drivetrain.


ggplot(data = mpg) + geom_smooth(mapping =
aes(x = displ, y = hwy, linetype = drv))

35

30

drv
25
4
hwy

f
r

20

15

2 3 4 5 6 7
displ

AA 2023-24 177 / 453


Introducing the tidyverse
Visualization

In case you are wondering ggplot2 provides over 40 geoms.

Some geoms allow you to add more variables to your plot.

AA 2023-24 178 / 453


Introducing the tidyverse
Visualization

# let's differentiate by drivetrain.


ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy))
35

30
hwy

25

20

2 3 4 5 6 7
displ

AA 2023-24 179 / 453


Introducing the tidyverse
Visualization

# let's differentiate by drivetrain.


ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, group = drv))

35

30

25
hwy

20

15

2 3 4 5 6 7
displ

AA 2023-24 180 / 453


Introducing the tidyverse
Visualization

# let's differentiate by drivetrain.


ggplot(data = mpg) +geom_smooth(
mapping = aes(x = displ, y = hwy, color = drv),
show.legend = FALSE
)
35

30

25
hwy

20

15

2 3 4 5 6 7
displ

AA 2023-24 181 / 453


Introducing the tidyverse
Visualization

Can we put more geoms together?

AA 2023-24 182 / 453


Introducing the tidyverse
Visualization

ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
geom_smooth(mapping = aes(x = displ, y = hwy))

40

30
hwy

20

2 3 4 5 6 7
displ

AA 2023-24 183 / 453


Introducing the tidyverse
Visualization

This is nice, but we see that basically we duplicate code. There is a more
efficient approach.

AA 2023-24 184 / 453


Introducing the tidyverse
Visualization

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +


geom_point() + geom_smooth()

40

30
hwy

20

2 3 4 5 6 7
displ

AA 2023-24 185 / 453


Introducing the tidyverse
Visualization

We basically moved the mapping function inside ggplot.

This is nice because it allows us to specify different aestetics in the two


layers of our plot.

AA 2023-24 186 / 453


Introducing the tidyverse
Visualization

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +


geom_point(mapping = aes(color = class)) +
geom_smooth()

40

class
2seater
30 compact
midsize
hwy

minivan
pickup
subcompact
suv

20

2 3 4 5 6 7
displ

AA 2023-24 187 / 453


Introducing the tidyverse
Visualization

In the layer defined by geom_smooth we may apply the filter function.

AA 2023-24 188 / 453


Introducing the tidyverse
Visualization

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +


geom_point(mapping = aes(color = class)) +
geom_smooth(data = filter(mpg,
class == "subcompact"), se = TRUE)

40

class
2seater
30 compact
midsize
hwy

minivan
pickup
subcompact
suv

20

2 3 4 5 6 7
displ

AA 2023-24 189 / 453


Introducing the tidyverse
Visualization

Let us look now at another data set on diamonds. We introduce a new


geom.

dim(diamonds)

[1] 53940 10

names(diamonds)

[1] "carat" "cut" "color" "clarity" "depth"


[6] "table" "price" "x" "y" "z"

AA 2023-24 190 / 453


Introducing the tidyverse
Visualization

ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))

20000

15000
count

10000

5000

Fair Good Very Good Premium Ideal


cut

AA 2023-24 191 / 453


Introducing the tidyverse
Visualization

Notice that the data set does not contain counts for our diamonds.

Some geoms plot the raw values, other plots, on the other hand, apply
some modification to the data to get the end result.

The algorithm used to calculate new values for a graph is called a stat.

This is typical of histograms, pie charts and any other graph that requires a
transformation of the data.

AA 2023-24 192 / 453


Introducing the tidyverse
Visualization

To every geom is associated a stat and the other way around, this makes
the production of a specific plot (say an histogram) automatic.

There are though some situations where you may consider using a stat
explicitly.

AA 2023-24 193 / 453


Introducing the tidyverse
Visualization

1 You do not like the default result and you want to override the default
stat.

# tribble is a function to create data sets


# ...or an alien species in Star Trek
demo <- tribble(
~cut, ~freq,
"Fair", 1610,
"Good", 4906,
"Very Good", 12082,
"Premium", 13791,
"Ideal", 21551
)

AA 2023-24 194 / 453


Introducing the tidyverse
Visualization

ggplot(data = demo) +
geom_bar(mapping = aes(x = cut, y = freq),
stat = "identity")

20000

15000
freq

10000

5000

Fair Good Ideal Premium Very Good


cut

AA 2023-24 195 / 453


Introducing the tidyverse
Visualization

Notice that, in a way, the last plot is conceptually more similar to a scatter
plot than a bar plot.

AA 2023-24 196 / 453


Introducing the tidyverse
Visualization

Problem
Then the question for you is: what does identity do?

AA 2023-24 197 / 453


Introducing the tidyverse
Visualization

2 You might want to override the default mapping from transformed


variables to aesthetics. For example, you might want to display a bar
chart of proportion.

AA 2023-24 198 / 453


Introducing the tidyverse
Visualization

ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y = after_stat(prop)))

1.00

0.75
prop

0.50

0.25

0.00

Fair Good Very Good Premium Ideal


cut

AA 2023-24 199 / 453


Introducing the tidyverse
Visualization

There seems to be something wrong.

What ggplot actually does is to take the levels of cut, which are Fair,
Good, Very Good, Premium and Ideal, and calculate proportions for each
bin.

So the proportion of fair diamonds in the Fair bin is one.

AA 2023-24 200 / 453


Introducing the tidyverse
Visualization

ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y = after_stat(prop),
group = "dummy argument"))

0.4

0.3
prop

0.2

0.1

0.0

Fair Good Very Good Premium Ideal


cut

AA 2023-24 201 / 453


Introducing the tidyverse
Visualization

Specifying a value for group makes sure that each level is proportional to
all levels.

AA 2023-24 202 / 453


Introducing the tidyverse
Visualization

3 You might want to draw greater attention to the statistical


transformation in your code.

AA 2023-24 203 / 453


Introducing the tidyverse
Visualization

For example, you might use stat_summary, which summarizes the y values
for each unique x value, to draw attention to the summary that you are
computing.

AA 2023-24 204 / 453


Introducing the tidyverse
Visualization

p<-ggplot(data = diamonds) +
stat_summary(
mapping = aes(x = cut, y = depth),
fun.min = min,
fun.max = max,
fun = median
)

AA 2023-24 205 / 453


Introducing the tidyverse
Visualization

p
80

70
depth

60

50

Fair Good Very Good Premium Ideal


cut

AA 2023-24 206 / 453


Introducing the tidyverse
Visualization

p<-ggplot(data = diamonds) +
stat_summary(
mapping = aes(x = cut, y = depth),
fun.min = function(x) mean(x) - sd(x),
fun.max = function(x) mean(x) + sd(x),
fun = mean
)

AA 2023-24 207 / 453


Introducing the tidyverse
Visualization

p
68

66
depth

64

62

60

Fair Good Very Good Premium Ideal


cut

AA 2023-24 208 / 453


Introducing the tidyverse
Visualization

One attractive feature of graphs is that we can use colour.

In ggplot we can perform this in multiple ways.

AA 2023-24 209 / 453


Introducing the tidyverse
Visualization

p<-ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, colour = cut))

AA 2023-24 210 / 453


Introducing the tidyverse
Visualization

20000

15000

cut
Fair
count

Good
Very Good
10000
Premium
Ideal

5000

Fair Good Very Good Premium Ideal


cut

AA 2023-24 211 / 453


Introducing the tidyverse
Visualization

Here we have used the colour aesthetic. Let us check what happen if we
use fill.

AA 2023-24 212 / 453


Introducing the tidyverse
Visualization

p<-ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = cut))

AA 2023-24 213 / 453


Introducing the tidyverse
Visualization

20000

15000

cut
Fair
count

Good
Very Good
10000
Premium
Ideal

5000

Fair Good Very Good Premium Ideal


cut

AA 2023-24 214 / 453


Introducing the tidyverse
Visualization

Notice that the x variable and the fill aesthetic are assigned the same
variable, cut. What happens if we change one of the two?

AA 2023-24 215 / 453


Introducing the tidyverse
Visualization

p<-ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity))

AA 2023-24 216 / 453


Introducing the tidyverse
Visualization

20000

15000
clarity
I1
SI2
SI1
count

VS2
10000 VS1
VVS2
VVS1
IF

5000

Fair Good Very Good Premium Ideal


cut

AA 2023-24 217 / 453


Introducing the tidyverse
Visualization

The stacking is performed automatically by the position adjustment


specified by the position argument. If you do not want a stacked bar chart,
you can use one of three other options: identity, dodge or fill.

AA 2023-24 218 / 453


Introducing the tidyverse
Visualization

position = "identity" will place each object exactly where it falls in


the context of the graph. We cannot see this very well in the bar chart.

AA 2023-24 219 / 453


Introducing the tidyverse
Visualization

p<-ggplot(data = diamonds, mapping = aes(x = cut,


fill = clarity)) +
geom_bar(alpha = 1/5, position = "identity")

AA 2023-24 220 / 453


Introducing the tidyverse
Visualization

5000

4000

clarity
I1
3000 SI2
SI1
count

VS2
VS1
VVS2
2000
VVS1
IF

1000

Fair Good Very Good Premium Ideal


cut

AA 2023-24 221 / 453


Introducing the tidyverse
Visualization

Let us try to see what happens without filling.

AA 2023-24 222 / 453


Introducing the tidyverse
Visualization

p<-ggplot(data = diamonds, mapping = aes(x = cut,


colour = clarity)) +
geom_bar(fill = NA, position = "identity")

AA 2023-24 223 / 453


Introducing the tidyverse
Visualization

5000

4000

clarity
I1
3000 SI2
SI1
count

VS2
VS1
VVS2
2000
VVS1
IF

1000

Fair Good Very Good Premium Ideal


cut

AA 2023-24 224 / 453


Introducing the tidyverse
Visualization

position = "fill" works like stacking, but makes each set of stacked
bars the same height. This makes it easier to compare proportions across
groups.

AA 2023-24 225 / 453


Introducing the tidyverse
Visualization

p<-ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity),
position = "fill")

AA 2023-24 226 / 453


Introducing the tidyverse
Visualization

1.00

0.75

clarity
I1
SI2
SI1
count

0.50 VS2
VS1
VVS2
VVS1
IF

0.25

0.00

Fair Good Very Good Premium Ideal


cut

AA 2023-24 227 / 453


Introducing the tidyverse
Visualization

position = "dodge" places overlapping objects directly beside one


another. This makes it easier to compare individual values.

AA 2023-24 228 / 453


Introducing the tidyverse
Visualization

p<-ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity),
position = "dodge")

AA 2023-24 229 / 453


Introducing the tidyverse
Visualization

5000

4000

clarity
I1
3000 SI2
SI1
count

VS2
VS1
VVS2
2000
VVS1
IF

1000

Fair Good Very Good Premium Ideal


cut

AA 2023-24 230 / 453


Introducing the tidyverse
Visualization

In some data sets we may have values that coincide. This causes some
points in, say, a scatterplot to overlap.

This property of the data may provide wrong insights. One thing that we
can do is to nudge our points with some random noise.

AA 2023-24 231 / 453


Introducing the tidyverse
Visualization

p<-ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy),
position = "jitter")

AA 2023-24 232 / 453


Introducing the tidyverse
Visualization

40

30
hwy

20

2 3 4 5 6 7
displ

AA 2023-24 233 / 453


Introducing the tidyverse
Visualization

It is very important to understand how to manipulate the coordinate


system. This is generally a complicated business.

ggplot has a couple of functions that may be used for this specific purpose.

AA 2023-24 234 / 453


Introducing the tidyverse
Visualization

coord_flip switches the x and y axes.

AA 2023-24 235 / 453


Introducing the tidyverse
Visualization

p<-ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +


geom_boxplot()

AA 2023-24 236 / 453


Introducing the tidyverse
Visualization

40

30
hwy

20

2seater compact midsize minivan pickup subcompact suv


class

AA 2023-24 237 / 453


Introducing the tidyverse
Visualization

p<-ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +


geom_boxplot()+
coord_flip()

AA 2023-24 238 / 453


Introducing the tidyverse
Visualization

suv

subcompact

pickup
class

minivan

midsize

compact

2seater

20 30 40
hwy

AA 2023-24 239 / 453


Introducing the tidyverse
Visualization

coord_quickmap sets the aspect ratio correctly for maps.

AA 2023-24 240 / 453


Introducing the tidyverse
Visualization

library(maps)
nz <- map_data("nz")
p<-ggplot(nz, aes(long, lat, group = group)) +
geom_polygon(fill = "white", colour = "black")

AA 2023-24 241 / 453


Introducing the tidyverse
Visualization

−36

−40
lat

−44

−48
170 175
long

AA 2023-24 242 / 453


Introducing the tidyverse
Visualization

p<-ggplot(nz, aes(long, lat, group = group)) +


geom_polygon(fill = "white", colour = "black") +
coord_quickmap()

AA 2023-24 243 / 453


Introducing the tidyverse
Visualization

−36

−40
lat

−44

−48
170 175
long

AA 2023-24 244 / 453


Introducing the tidyverse
Visualization

Problem
coord_polar uses polar coordinates.

See what happens when you apply polar coordinates to our bar plot.

AA 2023-24 245 / 453


Introducing the tidyverse
Visualization

What is the point so far?

We have tried to understand how to build potentially complex plots by


example.

There is though a common thread that connects the various examples.


This is called layered grammar of graphics.

AA 2023-24 246 / 453


Introducing the tidyverse
Visualization

The basic idea is that a graph is a template onto which we can


superimpose various layers of information

ggplot(data = <DATA>) +
<GEOM_FUNCTION>(
mapping = aes(<MAPPINGS>),
stat = <STAT>,
position = <POSITION>
) +
<COORDINATE_FUNCTION> +
<FACET_FUNCTION>

AA 2023-24 247 / 453


Introducing the tidyverse
Visualization

We have to feed seven parameters (the quantities within < >)

ggplot(data = <DATA>) +
<GEOM_FUNCTION>(
mapping = aes(<MAPPINGS>),
stat = <STAT>,
position = <POSITION>
) +
<COORDINATE_FUNCTION> +
<FACET_FUNCTION>

AA 2023-24 248 / 453


Introducing the tidyverse
Data transformation

Visualization is very informative, most of the times though we need to


modify the data in some ways.

AA 2023-24 249 / 453


Introducing the tidyverse
Data transformation

We will consider a data set containing all 336,776 flights that departed
from New York City in 2013. We will do our manipulations with the
package dplyer (already in the tydiverse).

library(nycflights13)

AA 2023-24 250 / 453


Introducing the tidyverse
Data transformation

flights

# A tibble: 336,776 x 19
year month day dep_time sched_dep_time dep_delay
<int> <int> <int> <int> <int> <dbl>
1 2013 1 1 517 515 2
2 2013 1 1 533 529 4
3 2013 1 1 542 540 2
4 2013 1 1 544 545 -1
5 2013 1 1 554 600 -6
6 2013 1 1 554 558 -4
7 2013 1 1 555 600 -5
8 2013 1 1 557 600 -3
9 2013 1 1 557 600 -3
10 2013 1 1 558 600 -2
# i 336,766 more rows
# i 13 more variables: arr_time <int>,
# sched_arr_time <int>, arr_delay <dbl>,
# carrier <chr>, flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>,
# distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>

AA 2023-24 251 / 453


Introducing the tidyverse
Data transformation

If you have worked with R already you will notice some differences with
other data frame structures.

That is in fact a tibble, which is a fancy data frame.

AA 2023-24 252 / 453


Introducing the tidyverse
Data transformation

This data set provides us with some information about the variables, the
size of the data set and the type of variables we are dealing with.

AA 2023-24 253 / 453


Introducing the tidyverse
Data transformation

In particular
1 int stands for integers.
2 dbl stands for doubles.
3 chr stands for character vectors or strings.
4 dttm stands for date-times (a date + a time).
5 lgl stands for logical, vectors that contain only TRUE or FALSE.
6 fctr stands for factors, which R uses to represent categorical
variables with fixed possible values.
7 date stands for dates.

AA 2023-24 254 / 453


Introducing the tidyverse
Data transformation

To intervene on the data set we will use the following verbs.


1 filter: pick observations by their values.
2 arrange: reorder the rows.
3 select: pick variables by their names.
4 mutate: create new variables with functions of existing variables.

AA 2023-24 255 / 453


Introducing the tidyverse
Data transformation

The logic of the verbs is similar


1 the first argument is a data frame
2 the subsequent arguments describe what to do with the data frame
3 the result is a new data frame of existing variables.

AA 2023-24 256 / 453


Introducing the tidyverse
Data transformation

Let us, as usual, see some examples.

AA 2023-24 257 / 453


Introducing the tidyverse
Data transformation

(dec25 <- filter(flights, month == 12, day == 25))

# A tibble: 719 x 19
year month day dep_time sched_dep_time dep_delay
<int> <int> <int> <int> <int> <dbl>
1 2013 12 25 456 500 -4
2 2013 12 25 524 515 9
3 2013 12 25 542 540 2
4 2013 12 25 546 550 -4
5 2013 12 25 556 600 -4
6 2013 12 25 557 600 -3
7 2013 12 25 557 600 -3
8 2013 12 25 559 600 -1
9 2013 12 25 559 600 -1
10 2013 12 25 600 600 0
# i 709 more rows
# i 13 more variables: arr_time <int>,
# sched_arr_time <int>, arr_delay <dbl>,
# carrier <chr>, flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>,
# distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>

AA 2023-24 258 / 453


Introducing the tidyverse
Data transformation

Notice that comparisons are carried out using the usual symbols: >, >= <,
<=, !=, ==

We can also use Booleans & and |.

Be careful when comparing numbers with the result of an operation

sqrt(2) ^ 2 == 2

[1] FALSE

near(sqrt(2) ^ 2, 2)

[1] TRUE

AA 2023-24 259 / 453


Introducing the tidyverse
Data transformation

Let us see some practical examples.


(nov_dec<-filter(flights, month == 11 | month == 12))

# A tibble: 55,403 x 19
year month day dep_time sched_dep_time dep_delay
<int> <int> <int> <int> <int> <dbl>
1 2013 11 1 5 2359 6
2 2013 11 1 35 2250 105
3 2013 11 1 455 500 -5
4 2013 11 1 539 545 -6
5 2013 11 1 542 545 -3
6 2013 11 1 549 600 -11
7 2013 11 1 550 600 -10
8 2013 11 1 554 600 -6
9 2013 11 1 554 600 -6
10 2013 11 1 554 600 -6
# i 55,393 more rows
# i 13 more variables: arr_time <int>,
# sched_arr_time <int>, arr_delay <dbl>,
# carrier <chr>, flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>,
# distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>

AA 2023-24 260 / 453


Introducing the tidyverse
Data transformation

We have to be careful, though.


(nov_dec<-filter(flights, month == (11 | 12)))

# A tibble: 27,004 x 19
year month day dep_time sched_dep_time dep_delay
<int> <int> <int> <int> <int> <dbl>
1 2013 1 1 517 515 2
2 2013 1 1 533 529 4
3 2013 1 1 542 540 2
4 2013 1 1 544 545 -1
5 2013 1 1 554 600 -6
6 2013 1 1 554 558 -4
7 2013 1 1 555 600 -5
8 2013 1 1 557 600 -3
9 2013 1 1 557 600 -3
10 2013 1 1 558 600 -2
# i 26,994 more rows
# i 13 more variables: arr_time <int>,
# sched_arr_time <int>, arr_delay <dbl>,
# carrier <chr>, flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>,
# distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>

AA 2023-24 261 / 453


Introducing the tidyverse
Data transformation

As usual we may have multiple alternatives.


(nov_dec <- filter(flights, month %in% c(11, 12)))

# A tibble: 55,403 x 19
year month day dep_time sched_dep_time dep_delay
<int> <int> <int> <int> <int> <dbl>
1 2013 11 1 5 2359 6
2 2013 11 1 35 2250 105
3 2013 11 1 455 500 -5
4 2013 11 1 539 545 -6
5 2013 11 1 542 545 -3
6 2013 11 1 549 600 -11
7 2013 11 1 550 600 -10
8 2013 11 1 554 600 -6
9 2013 11 1 554 600 -6
10 2013 11 1 554 600 -6
# i 55,393 more rows
# i 13 more variables: arr_time <int>,
# sched_arr_time <int>, arr_delay <dbl>,
# carrier <chr>, flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>,
# distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>

AA 2023-24 262 / 453


Introducing the tidyverse
Data transformation

Suppose now you have to deal with a complicated logical statement:

“Select the flights that do not have an arrival delay (A) greater than 120
nor a departure delay (B) greater than 120.”

In logical terms is

¬(A ∪ B) = ¬A ∩ ¬B

The equality follows from the De Morgan’s laws.

AA 2023-24 263 / 453


Introducing the tidyverse
Data transformation

As usual we may have multiple alternatives.

fl1<-filter(flights, !(arr_delay > 120 | dep_delay > 120))


fl2<-filter(flights, arr_delay <= 120, dep_delay <= 120)
identical(fl1,fl2)

[1] TRUE

AA 2023-24 264 / 453


Introducing the tidyverse
Data transformation

arrange operates on the rows.


# desc reorders a column in descending order
arrange(flights, desc(dep_delay))

# A tibble: 336,776 x 19
year month day dep_time sched_dep_time dep_delay
<int> <int> <int> <int> <int> <dbl>
1 2013 1 9 641 900 1301
2 2013 6 15 1432 1935 1137
3 2013 1 10 1121 1635 1126
4 2013 9 20 1139 1845 1014
5 2013 7 22 845 1600 1005
6 2013 4 10 1100 1900 960
7 2013 3 17 2321 810 911
8 2013 6 27 959 1900 899
9 2013 7 22 2257 759 898
10 2013 12 5 756 1700 896
# i 336,766 more rows
# i 13 more variables: arr_time <int>,
# sched_arr_time <int>, arr_delay <dbl>,
# carrier <chr>, flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>,
# distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>

AA 2023-24 265 / 453


Introducing the tidyverse
Data transformation

The verb select allows us to subset on the variables. This is particularly


useful when we deal with large data sets.

AA 2023-24 266 / 453


Introducing the tidyverse
Data transformation

# provide data set, choose three variables


select(flights, year, month, day)

# A tibble: 336,776 x 3
year month day
<int> <int> <int>
1 2013 1 1
2 2013 1 1
3 2013 1 1
4 2013 1 1
5 2013 1 1
6 2013 1 1
7 2013 1 1
8 2013 1 1
9 2013 1 1
10 2013 1 1
# i 336,766 more rows

AA 2023-24 267 / 453


Introducing the tidyverse
Data transformation

# provide data set, choose all variables


# between year and day
select(flights, year:day)

# A tibble: 336,776 x 3
year month day
<int> <int> <int>
1 2013 1 1
2 2013 1 1
3 2013 1 1
4 2013 1 1
5 2013 1 1
6 2013 1 1
7 2013 1 1
8 2013 1 1
9 2013 1 1
10 2013 1 1
# i 336,766 more rows

AA 2023-24 268 / 453


Introducing the tidyverse
Data transformation

# provide data set, choose all variables


# excluding what is between year and day
select(flights, -(year:day))

# A tibble: 336,776 x 16
dep_time sched_dep_time dep_delay arr_time
<int> <int> <dbl> <int>
1 517 515 2 830
2 533 529 4 850
3 542 540 2 923
4 544 545 -1 1004
5 554 600 -6 812
6 554 558 -4 740
7 555 600 -5 913
8 557 600 -3 709
9 557 600 -3 838
10 558 600 -2 753
# i 336,766 more rows
# i 12 more variables: sched_arr_time <int>,
# arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>

AA 2023-24 269 / 453


Introducing the tidyverse
Data transformation

When dealing with large data sets, we may not know exactly the name of
the variables of interest, or we may just know a part of them.
In this case we can use some helper functions
1 starts_with("abc"): matches names that begin with “abc”
2 ends_with("xyz"): matches names that end with “xyz”
3 contains("ijk"): matches names that contain “ijk”
4 num_range("x",1:3): matches x1, x2, x3
Other functions can be found by checking select.

AA 2023-24 270 / 453


Introducing the tidyverse
Data transformation

# variables that end with "ime" or variables


# that contain "ay"
select(flights, ends_with("ime"),contains("ay"))

# A tibble: 336,776 x 8
dep_time sched_dep_time arr_time sched_arr_time
<int> <int> <int> <int>
1 517 515 830 819
2 533 529 850 830
3 542 540 923 850
4 544 545 1004 1022
5 554 600 812 837
6 554 558 740 728
7 555 600 913 854
8 557 600 709 723
9 557 600 838 846
10 558 600 753 745
# i 336,766 more rows
# i 4 more variables: air_time <dbl>, day <int>,
# dep_delay <dbl>, arr_delay <dbl>

AA 2023-24 271 / 453


Introducing the tidyverse
Data transformation

# variables must both end with "ime"


# and contain "ay"
select(flights, ends_with("ime") & contains("ay"))

# A tibble: 336,776 x 0

AA 2023-24 272 / 453


Introducing the tidyverse
Data transformation

The verb mutate is arguably one of the most useful as it allows to operate
on the existing variables to create new ones. The new variables are
automatically included into the data set.

With mutate we can use standard arithmetic operations (+, -, *, /) and


apply well known functions (log, exp, mean,...).

Later we will see an intriguing example involving modular arithmetic.

AA 2023-24 273 / 453


Introducing the tidyverse
Data transformation

(flights_sml <- select(flights,


year:day,
ends_with("delay"),
distance,
air_time
))

# A tibble: 336,776 x 7
year month day dep_delay arr_delay distance
<int> <int> <int> <dbl> <dbl> <dbl>
1 2013 1 1 2 11 1400
2 2013 1 1 4 20 1416
3 2013 1 1 2 33 1089
4 2013 1 1 -1 -18 1576
5 2013 1 1 -6 -25 762
6 2013 1 1 -4 12 719
7 2013 1 1 -5 19 1065
8 2013 1 1 -3 -14 229
9 2013 1 1 -3 -8 944
10 2013 1 1 -2 8 733
# i 336,766 more rows
# i 1 more variable: air_time <dbl>

AA 2023-24 274 / 453


Introducing the tidyverse
Data transformation

mutate(flights_sml,
gain = dep_delay - arr_delay,
speed = distance / air_time * 60
)

# A tibble: 336,776 x 9
year month day dep_delay arr_delay distance
<int> <int> <int> <dbl> <dbl> <dbl>
1 2013 1 1 2 11 1400
2 2013 1 1 4 20 1416
3 2013 1 1 2 33 1089
4 2013 1 1 -1 -18 1576
5 2013 1 1 -6 -25 762
6 2013 1 1 -4 12 719
7 2013 1 1 -5 19 1065
8 2013 1 1 -3 -14 229
9 2013 1 1 -3 -8 944
10 2013 1 1 -2 8 733
# i 336,766 more rows
# i 3 more variables: air_time <dbl>, gain <dbl>,
# speed <dbl>

AA 2023-24 275 / 453


Introducing the tidyverse
Data transformation

If you want to keep only the new variables use transmute.

AA 2023-24 276 / 453


Introducing the tidyverse
Data transformation

transmute(flights,
gain = dep_delay - arr_delay,
hours = air_time / 60,
gain_per_hour = gain / hours
)

# A tibble: 336,776 x 3
gain hours gain_per_hour
<dbl> <dbl> <dbl>
1 -9 3.78 -2.38
2 -16 3.78 -4.23
3 -31 2.67 -11.6
4 17 3.05 5.57
5 19 1.93 9.83
6 -16 2.5 -6.4
7 -24 2.63 -9.11
8 11 0.883 12.5
9 5 2.33 2.14
10 -10 2.3 -4.35
# i 336,766 more rows

AA 2023-24 277 / 453


Introducing the tidyverse
Data transformation

The operator %/% stands for integer division and %% returns the remainder.

It is understood that for any two numbers x and y the following


relationship holds

x==y*(x%/%y)+(x%%y)

AA 2023-24 278 / 453


Introducing the tidyverse
Data transformation

transmute(flights,
dep_time,
hour = dep_time %/% 100,
minute = dep_time %% 100
)

# A tibble: 336,776 x 3
dep_time hour minute
<int> <dbl> <dbl>
1 517 5 17
2 533 5 33
3 542 5 42
4 544 5 44
5 554 5 54
6 554 5 54
7 555 5 55
8 557 5 57
9 557 5 57
10 558 5 58
# i 336,766 more rows

AA 2023-24 279 / 453


Introducing the tidyverse
Data transformation

The last verb is summarise which is generally used in combination with


group_by.

group_by splits the data into groups.

Once the data is grouped we can invoke summarise to collapse each group
into a single row summary. This generally happens via some summarizing
function.

AA 2023-24 280 / 453


Introducing the tidyverse
Data transformation

# Not a very informative example


summarise(flights, delay = mean(dep_delay, na.rm = TRUE))

# A tibble: 1 x 1
delay
<dbl>
1 12.6

AA 2023-24 281 / 453


Introducing the tidyverse
Data transformation

# A more informative example


by_carrier <- group_by(flights, carrier)
summarise(by_carrier, delay = mean(dep_delay, na.rm = TRUE))

# A tibble: 16 x 2
carrier delay
<chr> <dbl>
1 9E 16.7
2 AA 8.59
3 AS 5.80
4 B6 13.0
5 DL 9.26
6 EV 20.0
7 F9 20.2
8 FL 18.7
9 HA 4.90
10 MQ 10.6
11 OO 12.6
12 UA 12.1
13 US 3.78
14 VX 12.9
15 WN 17.7
16 YV 19.0

AA 2023-24 282 / 453


Introducing the tidyverse
Data transformation

Tidying a data set often takes multiple operations that it is convenient to


group in a minimal number of steps.

Let us suppose we want to investigate the relationship between the


distance and average delay for each location.

AA 2023-24 283 / 453


Introducing the tidyverse
Data transformation

by_dest <- group_by(flights, dest)


delay <- summarise(by_dest,
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
)
delay <- filter(delay, count > 20, dest != "HNL")

AA 2023-24 284 / 453


Introducing the tidyverse
Data transformation

ggplot(data = delay, mapping = aes(x = dist, y = delay)) +


geom_point(aes(size = count), alpha = 1/3) +
geom_smooth(se = FALSE)

40

30

count
20
4000
delay

8000
12000
16000

10

−10
0 1000 2000
dist

AA 2023-24 285 / 453


Introducing the tidyverse
Data transformation

Let us break down what we just did.


we grouped flights by destination
we used summarise to compute distance, average delay, and number
of flights
Honolulu is far away from everything so we removed it

AA 2023-24 286 / 453


Introducing the tidyverse
Data transformation

We can do the same thing in a faster way using the pipe %>%.

The pipe is contained in the package magrittr (pun intended).

AA 2023-24 287 / 453


Introducing the tidyverse
Data transformation

delays <- flights %>%


group_by(dest) %>%
summarise(
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
) %>%
filter(count > 20, dest != "HNL")

AA 2023-24 288 / 453


Introducing the tidyverse
Data transformation

ggplot(data = delays, mapping = aes(x = dist, y = delay)) +


geom_point(aes(size = count), alpha = 1/3) +
geom_smooth(se = FALSE)

40

30

count
20
4000
delay

8000
12000
16000

10

−10
0 1000 2000
dist

AA 2023-24 289 / 453


Introducing the tidyverse
Data transformation

What does the pipe do exactly? Basically, it passes an object (e.g. a data
frame) to the first argument of a function.

You can read the instructions as a series of imperative statements: group,


then summarise, then filter. Notice: %>% = then.

What we have is that x %>% f(y) turns into f(x, y), x %>% f(y) %>%
g(z) turns into g(f(x, y), z). . .

AA 2023-24 290 / 453


Introducing the tidyverse
Exploratory data analysis

Exploratory data analysis (EDA) is a specific task that consists of the


following steps
1 generate questions about the data
2 search for answers by visualizing, transforming and modeling the data
3 use what you learn to refine your questions and/or generate new
questions

AA 2023-24 291 / 453


Introducing the tidyverse
Exploratory data analysis

The goal during EDA is to develop an understanding of your data.

The idea that lurks behind this process is to take qualitative questions and
turn them into quantitative answers.

If you want you can see this as a deeply creative process and maybe even a
form of art.

AA 2023-24 292 / 453


Introducing the tidyverse
Exploratory data analysis

Despite the gazillion of possibilities to start EDA, there exist some fixed
points. You should ask
What type of variation occurs within my variables?
What type of covariation occurs between my variables?

AA 2023-24 293 / 453


Introducing the tidyverse
Exploratory data analysis

Variation is the property of a variable to change from measurement to


measurement.

Plots are often the best way to understand this feature.

The way we visualize a variable depends on some basic properties, in


particular whether such variable is categorical or continuous.

AA 2023-24 294 / 453


Introducing the tidyverse
Exploratory data analysis

# bar charts work well with categorical data


ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))

20000

15000
count

10000

5000

Fair Good Very Good Premium Ideal


cut

AA 2023-24 295 / 453


Introducing the tidyverse
Exploratory data analysis

# compute the height of each bar


diamonds %>% count(cut)

# A tibble: 5 x 2
cut n
<ord> <int>
1 Fair 1610
2 Good 4906
3 Very Good 12082
4 Premium 13791
5 Ideal 21551

AA 2023-24 296 / 453


Introducing the tidyverse
Exploratory data analysis

# histograms provide good representations of


# continuous variables
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = carat), binwidth = 0.5)

30000

20000
count

10000

0 2 4
carat

AA 2023-24 297 / 453


Introducing the tidyverse
Exploratory data analysis

diamonds %>%
count(cut_width(carat, 0.5))

# A tibble: 11 x 2
`cut_width(carat, 0.5)` n
<fct> <int>
1 [-0.25,0.25] 785
2 (0.25,0.75] 29498
3 (0.75,1.25] 15977
4 (1.25,1.75] 5313
5 (1.75,2.25] 2002
6 (2.25,2.75] 322
7 (2.75,3.25] 32
8 (3.25,3.75] 5
9 (3.75,4.25] 4
10 (4.25,4.75] 1
AA 2023-24 298 / 453
Introducing the tidyverse
Exploratory data analysis

An histogram basically divides the x-axis in equally spaced segments or bins


and counts how many data points fall in each bin.

The width of the bin is called the binwidth. Different binwidths may reveal
different patterns.

AA 2023-24 299 / 453


Introducing the tidyverse
Exploratory data analysis

smaller <- diamonds %>% filter(carat < 3)

ggplot(data = smaller, mapping = aes(x = carat)) +


geom_histogram(binwidth = 0.1)

10000

7500
count

5000

2500

1 2
carat

AA 2023-24 300 / 453


Introducing the tidyverse
Exploratory data analysis

In an histogram (and also in a bar chart) higher peaks denote higher


frequencies of occurrence of a given value.

This knowledge may be used to ask more questions. Which are the most
common (rare) values and why? Is this what we expected? Can we see
some special pattern and why?

AA 2023-24 301 / 453


Introducing the tidyverse
Exploratory data analysis

smaller <- diamonds %>% filter(carat < 3)

ggplot(data = smaller, mapping = aes(x = carat)) +


geom_histogram(binwidth = 0.01)

2000
count

1000

1 2
carat

AA 2023-24 302 / 453


Introducing the tidyverse
Exploratory data analysis

Problem
The above picture suggests the existence of clusters. Investigate why such
clusters appear.

AA 2023-24 303 / 453


Introducing the tidyverse
Exploratory data analysis

Outliers may be just data entry errors or may reveal some deeper properties
of the data generating process.

AA 2023-24 304 / 453


Introducing the tidyverse
Exploratory data analysis

ggplot(diamonds) +
geom_histogram(mapping = aes(x = y), binwidth = 0.5)

12000

8000
count

4000

0 20 40 60
y

AA 2023-24 305 / 453


Introducing the tidyverse
Exploratory data analysis

When we have loads of data they may be difficult to see, even though the
histogram provides a clue (the fact that it is very skewed to the left). Let
us zoom in.

AA 2023-24 306 / 453


Introducing the tidyverse
Exploratory data analysis

ggplot(diamonds) +
geom_histogram(mapping = aes(x = y), binwidth = 0.5) +
coord_cartesian(ylim = c(0, 50))

50

40

30
count

20

10

0 20 40 60
y

AA 2023-24 307 / 453


Introducing the tidyverse
Exploratory data analysis

We see that there are diamonds of size zero (?). They are probably
mistakes. We also see that there are a couple of big ones.

AA 2023-24 308 / 453


Introducing the tidyverse
Exploratory data analysis

unusual <- diamonds %>%


filter(y < 3 | y > 20) %>%
select(price, x, y, z) %>%
arrange(y)
unusual

# A tibble: 9 x 4
price x y z
<int> <dbl> <dbl> <dbl>
1 5139 0 0 0
2 6381 0 0 0
3 12800 0 0 0
4 15686 0 0 0
5 18034 0 0 0
6 2130 0 0 0
7 2130 0 0 0
8 2075 5.15 31.8 5.12
9 12210 8.09 58.9 8.06

AA 2023-24 309 / 453


Introducing the tidyverse
Exploratory data analysis

By looking at the price we may suspect that this is also a data entry
mistake.

To fix this you may either drop the line where the erroneous data are or,
maybe more appropriately replace them with NA.

AA 2023-24 310 / 453


Introducing the tidyverse
Exploratory data analysis

diamonds2 <- diamonds %>%


filter(between(y, 3, 20))
# or
diamonds2 <- diamonds %>%
mutate(y = ifelse(y < 3 | y > 20, NA, y))

AA 2023-24 311 / 453


Introducing the tidyverse
Exploratory data analysis

It is unclear how to represent missing values in graphs. What ggplot2


does for you is to remove them and give you a warning.

AA 2023-24 312 / 453


Introducing the tidyverse
Exploratory data analysis

ggplot(data = diamonds2, mapping = aes(x = x, y = y)) +


geom_point()+
geom_abline(intercept = 0, slope = 1, color="red",)+
stat_smooth(method="lm", se=FALSE)

6
y

0 3 6 9
x

AA 2023-24 313 / 453


Introducing the tidyverse
Exploratory data analysis

In some cases missing data may reveal more interesting information.

In our flight data, for example, missing values in departure time may
indicate that the flight was canceled.

AA 2023-24 314 / 453


Introducing the tidyverse
Exploratory data analysis

p<-nycflights13::flights %>%
mutate(
cancelled = is.na(dep_time),
sched_hour = sched_dep_time %/% 100,
sched_min = sched_dep_time %% 100,
sched_dep_time = sched_hour + sched_min / 60
) %>%
ggplot(mapping = aes(sched_dep_time,)) +
geom_freqpoly(mapping = aes(colour = cancelled),
binwidth = 1/4)

AA 2023-24 315 / 453


Introducing the tidyverse
Exploratory data analysis

10000

7500

cancelled
count

FALSE
5000
TRUE

2500

0 5 10 15 20 25
sched_dep_time

AA 2023-24 316 / 453


Introducing the tidyverse
Exploratory data analysis

freq_poly overlays histograms. Notice that they do not look as they


usually look.

Unfortunately, the result is not that great due to the fact that there are
more non canceled flights than canceled flights.

Comparing frequencies instead of counts may be a better idea.

AA 2023-24 317 / 453


Introducing the tidyverse
Exploratory data analysis

p<-nycflights13::flights %>%
mutate(
cancelled = is.na(dep_time),
sched_hour = sched_dep_time %/% 100,
sched_min = sched_dep_time %% 100,
sched_dep_time = sched_hour + sched_min / 60
) %>%
ggplot(mapping = aes(sched_dep_time,after_stat(density))) +
geom_freqpoly(mapping = aes(colour = cancelled),
binwidth = 1/4)

AA 2023-24 318 / 453


Introducing the tidyverse
Exploratory data analysis

0.15

0.10

cancelled
density

FALSE
TRUE

0.05

0.00

0 5 10 15 20 25
sched_dep_time

AA 2023-24 319 / 453


Introducing the tidyverse
Exploratory data analysis

Problem
Comment on the above picture.

AA 2023-24 320 / 453


Introducing the tidyverse
Exploratory data analysis

We have seen how to explore variation within the variable.

Covariation refers to the tendency of two or more variables to vary together.

We shall distinguish among cases depending on whether we are analyzing


categorical and/or continuous variables.

AA 2023-24 321 / 453


Introducing the tidyverse
Exploratory data analysis

Suppose we are dealing with a categorical and a continuous variable.

In this case we can overlay various histograms to compare the distribution


of a continuous feature according to existing categories

AA 2023-24 322 / 453


Introducing the tidyverse
Exploratory data analysis

ggplot(data = diamonds, mapping = aes(x = price)) +


geom_freqpoly(mapping = aes(colour = cut),
binwidth = 500)

5000

4000

cut
3000
Fair
count

Good
Very Good
Premium
Ideal
2000

1000

0 5000 10000 15000 20000


price

AA 2023-24 323 / 453


Introducing the tidyverse
Exploratory data analysis

Given the heterogeneity in counts, this may not be the best solution.

Using frequencies may yield a better comparison.

AA 2023-24 324 / 453


Introducing the tidyverse
Exploratory data analysis

ggplot(data = diamonds, mapping = aes(x = price,


y = after_stat(density))) +
geom_freqpoly(mapping = aes(colour = cut),
binwidth = 500)
5e−04

4e−04

3e−04
cut
Fair
density

Good
Very Good
Premium
2e−04
Ideal

1e−04

0e+00

0 5000 10000 15000 20000


price

AA 2023-24 325 / 453


Introducing the tidyverse
Exploratory data analysis

We see a bump in the lowest quality diamonds (fair) that suggests that
their average price may be the highest.

AA 2023-24 326 / 453


Introducing the tidyverse
Exploratory data analysis

Another way of comparing categorical and continuous variables is through


boxplots.

Boxplots feature median, the interquartile range (the box), two lines that
denote the range of the non outlier observations (the whiskers) and a
bunch of outside points that stretch beyond the whiskers (outliers).

AA 2023-24 327 / 453


Introducing the tidyverse
Exploratory data analysis

ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +


geom_boxplot()

15000

10000
price

5000

Fair Good Very Good Premium Ideal


cut

AA 2023-24 328 / 453


Introducing the tidyverse
Exploratory data analysis

Problem
The above plot seems to confirm our conjecture on the mean or median
prices of low quality diamonds. As an exercise dig deeper and try to
understand what is going on.

AA 2023-24 329 / 453


Introducing the tidyverse
Exploratory data analysis

We have pictured our boxplot as a function of an order factor (cut). Let


us see what happens if we use an unordered factor.

AA 2023-24 330 / 453


Introducing the tidyverse
Exploratory data analysis

ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +


geom_boxplot()

40

30
hwy

20

2seater compact midsize minivan pickup subcompact suv


class

AA 2023-24 331 / 453


Introducing the tidyverse
Exploratory data analysis

We may reorder the plot according to a given quantity, say, the median.

AA 2023-24 332 / 453


Introducing the tidyverse
Exploratory data analysis

ggplot(data = mpg) +
geom_boxplot(mapping =
aes(x = reorder(class, hwy, FUN = median),
y = hwy))

40

30
hwy

20

pickup suv minivan 2seater subcompact compact midsize


reorder(class, hwy, FUN = median)

AA 2023-24 333 / 453


Introducing the tidyverse
Exploratory data analysis

We explore now a couple of ways to study graphically the relationship


between categorical variables.

AA 2023-24 334 / 453


Introducing the tidyverse
Exploratory data analysis

ggplot(data = diamonds) +
geom_count(mapping = aes(x = cut, y = color))

n
1000
color

G 2000
3000
4000

Fair Good Very Good Premium Ideal


cut

AA 2023-24 335 / 453


Introducing the tidyverse
Exploratory data analysis

ggplot(data = diamonds) +
geom_count(mapping = aes(x = cut, y = color))

n
1000
color

G 2000
3000
4000

Fair Good Very Good Premium Ideal


cut

AA 2023-24 336 / 453


Introducing the tidyverse
Exploratory data analysis

The size of the point is proportional to the count for the given combination.

Another possibility is to explicitly compute the counts and feed them to a


tiles graph.

AA 2023-24 337 / 453


Introducing the tidyverse
Exploratory data analysis

diamonds %>%
count(color, cut)

# A tibble: 35 x 3
color cut n
<ord> <ord> <int>
1 D Fair 163
2 D Good 662
3 D Very Good 1513
4 D Premium 1603
5 D Ideal 2834
6 E Fair 224
7 E Good 933
8 E Very Good 2400
9 E Premium 2337
10 E Ideal 3903
# i 25 more rows

AA 2023-24 338 / 453


Introducing the tidyverse
Exploratory data analysis

diamonds %>%
count(color, cut) %>%
ggplot(mapping = aes(x = color, y = cut)) +
geom_tile(mapping = aes(fill = n))

Ideal

Premium

4000
cut

Very Good 3000

2000

1000

Good

Fair

D E F G H I J
color

AA 2023-24 339 / 453


Introducing the tidyverse
Exploratory data analysis

We have already seen a very useful tool to plot a relationship between


continuous variables: the scatterplot.

Now we will see a couple of details and a fancy alternative.

AA 2023-24 340 / 453


Introducing the tidyverse
Exploratory data analysis

ggplot(data = diamonds) +
geom_point(mapping = aes(x = carat, y = price))

15000

10000
price

5000

0 1 2 3 4 5
carat

AA 2023-24 341 / 453


Introducing the tidyverse
Exploratory data analysis

When our data set gets large, the points we put on the graph may be
overplotted by other points. A workaround for this undesirable feature is to
use shades.

AA 2023-24 342 / 453


Introducing the tidyverse
Exploratory data analysis

ggplot(data = diamonds) +
geom_point(mapping = aes(x = carat, y = price)
, alpha = 1 / 100)

15000

10000
price

5000

0 1 2 3 4 5
carat

AA 2023-24 343 / 453


Introducing the tidyverse
Exploratory data analysis

Another interesting approach is to use bins, as in the case of histograms.


We can plot squared bins, but hexagonal bins are nicer.

AA 2023-24 344 / 453


Introducing the tidyverse
Exploratory data analysis

library(hexbin)
ggplot(data = smaller) +
geom_hex(mapping = aes(x = carat, y = price))

15000

count

6000
10000
price

4000

2000

5000

1 2
carat

AA 2023-24 345 / 453


Introducing the tidyverse
Exploratory data analysis

We will conclude here our extremely brief journey into data analysis. We
only covered a small fraction of the potential topics. For such a reason I
urge you to explore the book by Wickham & Grolemund beyond the topics
covered.

AA 2023-24 346 / 453


Applications

We will consider now various applications. We will not spend too much
time on the theoretical aspects of the methods and we will try to
understand how to make things work in practice.

The examples come from the various books mentioned in your reading list.

AA 2023-24 347 / 453


Applications
Monte Carlo simulations

We open our applications section by considering a very important tool, i.e.


Monte Carlo simulation.

Monte Carlo simulation is used, among other things, to study the


behaviour of estimators and test statistics in finite samples.

AA 2023-24 348 / 453


Applications
Monte Carlo simulations

Suppose we are working with a certain estimation method (OLS, IV, GMM,
or something else).

Of this method we know the behaviour when n → ∞.

Hence, our empirical analysis would rely on the fact that our asymptotic
distributions are sufficiently accurate to describe the finite sample
distributions that are, by the way, unknown.

AA 2023-24 349 / 453


Applications
Monte Carlo simulations

To verify the quality of our asymptotic results we may specify a simulation


experiment where we completely describe a DGP (i.e. we know everything
about how the data are produced).

This is, we sample a large number of data sets from a given distribution.

Then, we compute some quantities for which we have a theoretically


justified expectation about their behaviour.

For example, if the estimator is consistent we expect it to be close to its


true value, if it is asymptotically normal we expect its histogram to be close
to a normal density, if we run a t test at 5% nominal size we expect to
reject 5% of the times.

AA 2023-24 350 / 453


Applications
Monte Carlo simulations

Let us consider an example involving a probit model.

The probit model is used (along with other binary choice models) when the
outcome variable takes only two values.

AA 2023-24 351 / 453


Applications
Monte Carlo simulations

Consider

yi∗ = xi′ θ + εi

where yi∗ is a continuous but unobserved variable. Assume

yi = 1(yi∗ > 0) = 1(xi′ θ + εi > 0)

where yi is observed.

AA 2023-24 352 / 453


Applications
Monte Carlo simulations

Moreover,

E [1(yi∗ > 0)|xi ] = E [1(xi′ θ + εi > 0)] = P(εi > −xi′ θ) = Φ(−xi′ θ).

It is well known that the loglikelihood function for the probit model is
n
X
yi log Φ(xi′ θ) + (1 − yi ) log(1 − Φ(xi′ θ))

ℓ=
i=1

AA 2023-24 353 / 453


Applications
Monte Carlo simulations

It is possible to show that



n(θb − θ) →d N(0, Ω−1 )

where
(ϕ(xi′ θ))2
 
Ω=E xi x ′
Φ(xi′ θ)(1 − Φ(xi′ θ)) i

AA 2023-24 354 / 453


Applications
Monte Carlo simulations

To build our Monte Carlo experiment we have to completely specify our


data generating process. This is, we have to specify which distribution
produces the data and the value of the true parameter.

Specifically, we have a bunch of random variables yi∗ , xi and εi .

We do not need to choose yi∗ because we know that it is the result of a


combination between xi and εi . We must though specify θ

Notice that i = 1, . . . , n and that xi is, say, a k-vector. Once everything is


specify we can calculate the statistics of interest and replicate a R times,
where R is a big number.

AA 2023-24 355 / 453


Applications
Monte Carlo simulations

Let us write a small algorithm


1. Sample xi and εi and construct yi .
2. Compute a statistic of interest.
3. Repeat 1. and 2. R times.
Let us put some flesh into this algorithm.

AA 2023-24 356 / 453


Applications
Monte Carlo simulations

First define the likelihood function.

l.probit<-function(theta,y,x){
Phi<-pnorm(x%*%theta)
logl<- sum(y*log(Phi))+sum((1-y)*log(1-Phi))
return(-logl)
}

AA 2023-24 357 / 453


Applications
Monte Carlo simulations

set.seed(12345) # set random seed for replicability

R<-1000 # number or Monte Carlo repetitions

n<-100 # sample size


theta0<- c(0.1,0.1,0.1) # true parameter vector
theta.hat<-matrix(NA,R,length(theta0))
se<-matrix(NA,R,length(theta0))
t.test<-matrix(NA,R,length(theta0))

AA 2023-24 358 / 453


Applications
Monte Carlo simulations

for (i in 1:R) {
x<-cbind(1,rnorm(n,1,1),rnorm(n,1,1))
eps<- rnorm(n)
y.star<- x%*%theta0+eps
y<-1*(y.star>0)
res<-optim(theta0,l.probit,x=x,y=y,
hessian=TRUE, method = "L-BFGS-B")
theta.hat[i,]<-res$par
Omega.hat<-solve(res$hessian)
se[i,]<-sqrt(diag(Omega.hat))
t.test[i,]<- (theta.hat[i,]-theta0)/se[i,]
}
#glm1<-glm(y ~ -1+x, family = binomial(link="probit"))

AA 2023-24 359 / 453


Applications
Monte Carlo simulations

df <- data.frame(t.test = t.test[,2])


p<-ggplot(df, aes(x = t.test)) +
geom_histogram(
breaks = seq(min(t.test[,2]), max(t.test[,2]),
by = .80),
colour = "pink",
fill = "white"
)

AA 2023-24 360 / 453


Applications
Monte Carlo simulations

300

200
count

100

−2 0 2
t.test

#hist(t[,2], freq=TRUE)

AA 2023-24 361 / 453


Applications
Monte Carlo simulations

In a previous example we have introduced the size of a test, which is the


probability of rejecting the null when the null is actually true (type I error
or α or false positive):

P(reject H0 |H0 true) = α

The researcher chooses α but we want our test not to exceed such value,
so to speak.

We generally do not know whether our test does overreject or not. We use
simulation to build insight.

AA 2023-24 362 / 453


Applications
Monte Carlo simulations

For example, what is the size of the t test we calculated via Monte Carlo
simulation in the previous slide?

Notice

P(reject H0 |H0 true) = E [1(reject H0 )|H0 true]

We know that H0 is true because we built it that way, so we can use the
analogy principle
R
1X
1(|t| > q).
R
r =1

where q is an appropriately chosen quantile.

AA 2023-24 363 / 453


Applications
Monte Carlo simulations

mean(1*(abs(t.test[,2])>1.96))

[1] 0.053

It seems that our t test provides us a quite accurate size.

AA 2023-24 364 / 453


Applications
Monte Carlo simulations

The type II error or β or false negative is the probability of failing to reject


(i.e. of accepting) the null when it is false

P(do not reject H0 |H0 false) = β

The researcher does not choose β but we want it to be small.

AA 2023-24 365 / 453


Applications
Monte Carlo simulations

A related quantity is

P(reject H0 |H0 false) = 1 − P(do not rejectH0 |H0 false) = 1 − β

This is called the power of a test. A good test is able to discriminate


among various alternatives.

Let us see an example by testing an explicitly false null, say


H0 : θ = 0.100001

AA 2023-24 366 / 453


Applications
Monte Carlo simulations

t.test1<-(theta.hat[,2]-0.100001)/se[,2]
mean(1*(abs(t.test1)>1.96))

[1] 0.053

Probably it is too close to the true value.

AA 2023-24 367 / 453


Applications
Monte Carlo simulations

t.test1<-(theta.hat[,2]-0.2)/se[,2]
mean(1*(abs(t.test1)>1.96))

[1] 0.115

We start seeing something.

AA 2023-24 368 / 453


Applications
Monte Carlo simulations

Problem
Try to draw a picture where on the x axis you have a set of alternatives and
on the y axis you place the rejection rates.

AA 2023-24 369 / 453


Applications
The bootstrap

We can use the simulation machinery we built in our next endeavour, which
is the bootstrap.

“The bootstrap is a method for estimating the distribution of an estimator


or test statistic by resampling one’s data or a model estimated from the
data.” (Joel Horowitz)

AA 2023-24 370 / 453


Applications
The bootstrap

To understand the underlying principle suppose we want to estimate a


sample mean.

One way to think of it is to suppose that there is population form which we


want learn the mean, say µ = E [x].

We do not see the population but we haveP a sample from which we can
b = n1 ni=1 xi .
calculate an estimator for µ, say µ

AA 2023-24 371 / 453


Applications
The bootstrap

In the bootstrap world the population is the sample and the sample is the
resample.

So, what is the resample? Let us consider a simple example.

AA 2023-24 372 / 453


Applications
The bootstrap

Let x be a (vector valued) random variable with distribution F and let


xi , i = 1, . . . , n be a sample with empirical distribution Fn . This is, for
every z
n
1X
Fn (z) = 1(xi ≤ z)
n
i=1

Suppose we are interested in a function of the data, say, Tn



Tn =
sn
1 Pn 2 1 Pn
x̄ = n i=1 xi , sn = n−1 i=1 (xi − x̄)2 .

AA 2023-24 373 / 453


Applications
The bootstrap

To implement the bootstrap we pretend that the population distribution is


Fn . Then,
1 resample the original data with replacement to generate the bootstrap
resample xi∗ , i = 1, . . . , n
2 build Tn∗
3 repeat 1 and 2 many times
4 compute the empirical probability of the event Tn∗ ≤ τ .

AA 2023-24 374 / 453


Applications
The bootstrap

Recall that the bias of an estimator is defined as


b − θ0 ≈ E ∗ [θ∗ ] − θ.
Bias = E [θ] b

We do not know the left hand side but we do know the right hand side.

AA 2023-24 375 / 453


Applications
The bootstrap

set.seed(1234)
x<-rt(100,5);y<-rchisq(100,2);B<-499
x.bar.star<-y.bar.star<-c()
for(i in 1:B){
x.bar.star[i]<-mean(sample(x,replace=TRUE))
y.bar.star[i]<-mean(sample(y,replace=TRUE))
}

AA 2023-24 376 / 453


Applications
The bootstrap

mean(x)-0;mean(x.bar.star)-mean(x)

[1] 0.0294
[1] -0.000656

mean(y)-2;mean(y.bar.star)-mean(y)

[1] 0.476
[1] -0.0145

AA 2023-24 377 / 453


Applications
The bootstrap

Another neat application of the bootstrap is the approximation of the


distribution of estimators.

Let us go a step back: we built estimators using (hopefully) appropriate


methods. Asymptotics suggests that they behave in a certain way in the
limit. In our world they are Gaussian.

AA 2023-24 378 / 453


Applications
The bootstrap

We know that asymptotic theory provides but an approximation of the true


finite sample distribution and an approximation may be a poor one.

The bootstrap provides an alternative approximation to that resulting from


the asymptotic results that may (hopefully) work better in finite samples.

These improvements are often referred to as asymptotic refinements.

AA 2023-24 379 / 453


Applications
The bootstrap

data("Journals")
journals <- Journals[, c("subs", "price")]
journals$citeprice <- Journals$price/Journals$citations
y<-log(journals$subs)
X<-cbind(1,log(journals$citeprice))
(beta.hat<-solve(crossprod(X))%*%crossprod(X,y))

[,1]
[1,] 4.766
[2,] -0.533

e<-y-X%*%beta.hat
V.hat<-solve(crossprod(X)/length(y))*
as.numeric(crossprod(e)/length(y))
t<-sqrt(length(y))*(beta.hat[2]-0)/sqrt(V.hat[2,2])

AA 2023-24 380 / 453


Applications
The bootstrap

We want to make inference on log(citeprice). The t test is

√ βb2 − β20 √ βb2


t= n = n
σ
b2 σ
b2
The bootstrap version is

√ β2∗ − βb2
t∗ = n
σ2∗

AA 2023-24 381 / 453


Applications
The bootstrap

set.seed(2356)
B<-99
t.star<-c()
n<-length(y)
for (i in 1:B) {
rsmpl<-sample(1:n,replace = TRUE)
y.star<-y[rsmpl];X.star<-X[rsmpl,]
beta.star<-solve(crossprod(X.star))%*%
crossprod(X.star,y.star)
e.star<-y.star-X.star%*%beta.star
V.star<-as.numeric(crossprod(e.star)/n)*
solve(crossprod(X.star)/n)
t.star[i]<-sqrt(n)*(beta.star[2]-beta.hat[2])/
sqrt(V.star[2,2])
}
AA 2023-24 382 / 453
Applications
The bootstrap

hist(t.star)

20
15 Histogram of t.star
Frequency

10
5
0

−2 −1 0 1 2

t.star

AA 2023-24 383 / 453


Applications
The bootstrap

if(abs(t)>=quantile(t.star,0.975)){
decision<-"reject the null"
}else{
decision<-"cannot reject the null"
}
decision

[1] "reject the null"

AA 2023-24 384 / 453


Applications
The bootstrap

Problem
Here are some simple and not so simple questions for you
1 Can you work out a 95% confidence interval for the above example?
2 Provide a similar result to the above using the percentile method.
3 The methods we use do not necessarily apply in the presence of serial
dependence or heteroskedasticity. Provide simulation evidence in the
context of the linear regression model.

AA 2023-24 385 / 453


Applications
Causality

Estimating causal effects is one fundamental problem in applied


econometrics.

There are a number of possible methods to model causal relationships: we


will provide an approach based on the linear model and directed acyclic
graphs (DAGs).

AA 2023-24 386 / 453


Applications
Causality

It is important to bear in mind that looking at the data alone does not help
us to uncover causal relationships.

For that we need some knowledge of the process we are studying. We hope
that such knowledge can be represented with a DAG.

AA 2023-24 387 / 453


Applications
Causality

Let us look at some examples.

AA 2023-24 388 / 453


Applications
Causality

Let us generate some data

n<-1000
x1<-rnorm(n);y1<-x1+1+sqrt(3)*rnorm(n)
y2<-1+2*rnorm(n);x2<-(y2-1)/4+sqrt(3)*rnorm(n)/2
z3<-rnorm(n);y3<-z3+1+sqrt(3)*rnorm(n);x3<-z3
cor(x1,y1);cor(x2,y2);cor(x3,y3)

[1] 0.539
[1] 0.494
[1] 0.493

What do the plots of x versus y look like?

AA 2023-24 389 / 453


Applications
Causality

library("ggExtra")
p <- ggplot(as.data.frame(cbind(x1,y1)), aes(x=x1, y=y1))+
geom_point(col="red")
ggMarginal(p, type = "density")

5
y1

−5

−2 0 2
x1

AA 2023-24 390 / 453


Applications
Causality

library("ggExtra")
p <- ggplot(as.data.frame(cbind(x2,y2)), aes(x=x2, y=y2))+
geom_point(col="blue")
ggMarginal(p, type = "density")

5.0

2.5
y2

0.0

−2.5

−5.0

−2 0 2
x2

AA 2023-24 391 / 453


Applications
Causality

library("ggExtra")
p <- ggplot(as.data.frame(cbind(x3,y3)), aes(x=x3, y=y3))+
geom_point(col="green")
ggMarginal(p, type = "density")

7.5

5.0

2.5
y3

0.0

−2.5

−5.0

−2 0 2
x3

AA 2023-24 392 / 453


Applications
Causality

The data are produced in different ways yet from the scatter plots their
joint distributions are indistinguishable.

How can we infer causal relationships between x and y ?

AA 2023-24 393 / 453


Applications
Causality

Problem
Look at the DGPs for x, y and z and study their corresponding DAG
representation. You may use the library daggity.

AA 2023-24 394 / 453


Applications
Causality

Let us try to understand how to estimate causal effects using a general


approach.

We start with the do-calculus and we study two relevant cases: IV and the
front-door criterion (you may remember the back-door criterion we saw in
Econometrics I).

AA 2023-24 395 / 453


Applications
Causality: do-calculus

The do-calculus is a set of (three) rules that can be applied to a DAG and
that help us identify causal effects.

When dealing with the do-calculus you will see the so called do-operator
denoted as do(X = x) or do(x).

The rules of do-calculus allow us to transform a causal query that includes


the do-operator in one that does not, i.e. we can actually estimate such an
object!

AA 2023-24 396 / 453


Applications
Causality: do-calculus

Here are the three do-calculus rules.


1 When a variable does not affect the outcome through any path it can
be ignored.
2 When the causal effect of a variable affects the outcome only through
directed paths interventions and observations are equivalent (i.e.
do(x) and x are equivalent).
3 When an intervention (say, do(x)) does not influence the outcome
through any path, it can be ignored.
Notice that an intervention do(x) can be seen as if we were operating
surgery on a DAG.

AA 2023-24 397 / 453


Applications
Causality: do-calculus

Let us introduce some useful notation and see some details using equations
and DAGs.

Consider a graph G that includes, among others, a certain variable X . We


denote
Denote GX as the graph G where you delete the arrows going into X ,
i.e. X is an orphan.
Denote GX as the graph G where you delete the arrows going out of
X , i.e. X is childless.
In the following example we consider a DAG with Y , X , W , Z . We are
interested in X → Y and we want to eliminate Z .

AA 2023-24 398 / 453


Applications
Causality: rule 1 of do-calculus

Formally, rule 1 says

P(y |z, do(x), w ) = P(y |do(x), w ) if (Y ⊥


⊥ Z |W , X )GX .

This rule says that we can ignore information (independence), namely Z ,


once we condition for W and X in the modified graph GX . We say that Y
and Z are d-separated.

AA 2023-24 399 / 453


Applications
Causality: rule 1 of do-calculus

Let us see this rule graphically. First load some packages.

library(dagitty)
library(ggdag)
library(patchwork) # For combining plots

AA 2023-24 400 / 453


Applications
Causality: rule 1 of do-calculus

rule1_g <- dagitty("dag {


X -> Y <- W
W -> X
X -> Z <- W
}")
coordinates(rule1_g) <- list(x=c(X=1,Y=2,Z=1,W=1.5),
y=c(X=1,Y=1,Z=2,W=2))

rule1_g_x_over <- dagitty("dag {


X -> Y <- W
X -> Z <- W
}")
coordinates(rule1_g_x_over) <- list(x=c(X=1,Y=2,Z=1,W=1.5),
y=c(X=1,Y=1,Z=2,W=2))

AA 2023-24 401 / 453


Applications
Causality: rule 1 of do-calculus

p.rule1_g<-ggdag(rule1_g)+theme_dag()
p.rule1_g_x_over<-ggdag(rule1_g_x_over)+theme_dag()

AA 2023-24 402 / 453


Applications
Causality: rule 1 of do-calculus

On the left we have G , on the right GX .

p.rule1_g | p.rule1_g_x_over

Z W Z W

X Y X Y

AA 2023-24 403 / 453


Applications
Causality: rule 1 of do-calculus

You may recall that our question is whether in computing the causal effect
X → Y we need to include somehow Z or can we just plain ignore it.

We can apply rule 1 to GX and notice that once we condition upon W and
X , Y and Z are d-separated. Thus

P(y |z, do(x), w ) = P(y |do(x), w ).

In sum, rule 1 offers a method to remove “redundant” nodes but it does


not really involve the do-operator.

AA 2023-24 404 / 453


Applications
Causality: rule 2 of do-calculus

When looking at rule 2 we recognize three components: treatment x,


outcome y and a variable that induces confounding z.

From rule 2 we get a condition for treating the interventional treatment


do(z) as observational, i.e. as plain and simple z. In formulae

P(y |do(z), do(x), w ) = P(y |z, do(x), w ) if (Y ⊥⊥ Z |W , X )GX ,Z

where GX ,Z is the graph where we remove the arrows coming out of Z and
the arrows coming into X .

AA 2023-24 405 / 453


Applications
Causality: rule 2 of do-calculus

rule2_g <- dagitty("dag {


W -> X -> Y <- W
Z -> Y
X -> Z <- W
}")
coordinates(rule2_g) <- list(x=c(X=1,Y=2,Z=1,W=1.5),
y=c(X=1,Y=1,Z=2,W=2))

rule2_g_modif <- dagitty("dag {


X -> Y <- W
X -> Z <- W
}")
coordinates(rule2_g_modif) <- list(x=c(X=1,Y=2,Z=1,W=1.5),
y=c(X=1,Y=1,Z=2,W=2))

AA 2023-24 406 / 453


Applications
Causality: rule 2 of do-calculus

p.rule2_g<-ggdag(rule2_g)+theme_dag()
p.rule2_g_modif<-ggdag(rule2_g_modif)+theme_dag()

AA 2023-24 407 / 453


Applications
Causality: rule 2 of do-calculus

On the left we have G , on the right GX ,Z .

p.rule2_g | p.rule2_g_modif

Z W Z W

X Y X Y

AA 2023-24 408 / 453


Applications
Causality: rule 2 of do-calculus

Do not get confused by the fact that we have two interventions here.
Typically, in applications you see only one.

We want to see whether we can treat do(z) as simple z. This is possible if


in the modified graph Z and Y are d-separated after controlling for X and
W , which is what happens.

AA 2023-24 409 / 453


Applications
Causality: rule 3 of do-calculus

Rule 3 is the most complicated one and it tells us when we can remove a
do-modified variable. Formally,

P(y |do(z), do(x), w ) = P(y |do(x), w ) if (Y ⊥


⊥ Z |W , X )GX ,Z (W )

where Z (W ) reads “any node Z that is not an ancestor of W ” (we will see
how it plays out in the graph).This is, we can remove do(z) if there is no
association or no unblocked paths from Z to Y .

AA 2023-24 410 / 453


Applications
Causality: rule 3 of do-calculus

rule3_g <- dagitty("dag {


X <- Z <- W -> Y
X -> Y
}")
coordinates(rule3_g) <- list(x=c(X=1,Y=2,Z=1,W=1.5),
y=c(X=1,Y=1,Z=2,W=2))

rule3_g_modif <- dagitty("dag {


Z
X -> Y <- W
}")
coordinates(rule3_g_modif) <- list(x=c(X=1,Y=2,Z=1,W=1.5),
y=c(X=1,Y=1,Z=2,W=2))

AA 2023-24 411 / 453


Applications
Causality: rule 3 of do-calculus

p.rule3_g<-ggdag(rule3_g)+theme_dag()
p.rule3_g_modif<-ggdag(rule3_g_modif)+theme_dag()

AA 2023-24 412 / 453


Applications
Causality: rule 3 of do-calculus

On the left we have G , on the right GX ,Z (W ) .

p.rule3_g | p.rule3_g_modif

Z W Z W

X Y X Y

AA 2023-24 413 / 453


Applications
Causality: rule 3 of do-calculus

Our operations have d-separated Z from all the other variables. Hence, we
can remove do(z)

P(y |do(z), do(x), w ) = P(y |do(x), w ).

AA 2023-24 414 / 453


Applications
Causality: rule 3 of do-calculus

Using the rules of do-calculus we can derive the formulae for the backdoor
and frontdoor adjustment.

We are not going to do that.

Since you have an idea of what the backdoor criterion is, we explore the
frontdoor criterion with an example.

AA 2023-24 415 / 453


Applications
Causality: frontdoor criterion

The above description of the rules of the do-calculus is not too intuitive.

Let us try to understand how we could (implicitly) apply it using what we


know best: the linear regression model.

We are going to see how to identify the causal effect of X on Y given a


mediator M and a confounding (unobserved) variable U.

AA 2023-24 416 / 453


Applications
Causality: frontdoor criterion

dag <- dagitty("dag {


X -> M -> Y
X <- U -> Y
}")

AA 2023-24 417 / 453


Applications
Causality: frontdoor criterion

p<-ggdag(dag)+theme_dag();p

AA 2023-24 418 / 453


Applications
Causality: frontdoor criterion

The above graph corresponds to the set of equations

Y = gY (M, U) = a1 + a2 M + a3 U + VY
X = gX (U) = b1 + b2 U + VX
M = gM (X ) = c1 + c2 X + VM

where the V s are unobserved exogenous variables (omitted in the graph).

AA 2023-24 419 / 453


Applications
Causality: frontdoor criterion

Our problem is to estimate the total effect of X on Y , i.e. X → Y .

This effect is not identified due to the confounding effect of U and since it
is unobserved we cannot block the backdoor path.

AA 2023-24 420 / 453


Applications
Causality: frontdoor criterion

However, we can observe the following


1 we can estimate the effect of X on M as the confounding path
X ← U → Y ← M is blocked by Y , a collider.
2 we can estimate the effect of M on Y after controlling for X .

AA 2023-24 421 / 453


Applications
Causality: frontdoor criterion

However, we can observe the following


1 we can estimate the effect of X on M as the confounding path
X ← U → Y ← M is blocked by Y , a collider.
2 we can estimate the effect of M on Y after controlling for X .
We can combine this information to obtain the total effect of X on Y .

AA 2023-24 422 / 453


Applications
Causality: frontdoor criterion

So, what is the total effect of X on Y ? Notice

Y = a1 + a2 M + a3 U + VY = a1 + a2 (c1 + c2 X + VM ) + a3 U + VY
= (a1 + a2 c2 ) + a2 c2 X + a3 U + (a2 VM + VY ).

Thus, the total effect of X on Y is a2 c2 . This cannot be estimated


consistently because we cannot block the backdoor path through U.

However, we can estimate a2 and c2 separately and then take the product.

AA 2023-24 423 / 453


Applications
Causality: frontdoor criterion

Problem (Identification and estimation via the frontdoor criterion)


Consider the following DGP

Y = gY (M, U) = a1 + a2 M + a3 U + VY
X = gX (U) = b1 + b2 U + VX
M = gM (X ) = c1 + c2 X + VM .

Show via a Monte Carlo experiment that estimating the model

Y = β0 + β1 X + ε

does not recover the causal effect. Using the information described in the
the corresponding DAG show how to consistently estimate the causal effect.

AA 2023-24 424 / 453


Applications
Causality: frontdoor criterion

To see a numerical example for the frontdoor criterion look at this page
from Felix Thoemmes’ website and an example geared more towards
economics here from Alex Chinco’s website. The discussion on the
do-calculus rule heavily draws from this post by Andrew Heiss (you may
also like this chapter) and these posts from Ferenc Huszár’s website.

You may notice that much of this discussion relies little on the literature by
scholars in the field of economics. This is because economists have started
working with graphical models only recently.

AA 2023-24 425 / 453


Applications
Causality: instrumental variables

Let us consider a variable X = years of schooling , a response variable


Y = income and U a set of unknown causes.

Let us denote the size of X on Y as b. The causal model, assuming


linearity is

Y = a + bX + U

The same model can be represented graphically with a DAG.

AA 2023-24 426 / 453


Applications
Causality: instrumental variables

dag <- dagitty("dag {


X -> Y
U -> Y
}")

AA 2023-24 427 / 453


Applications
Causality: instrumental variables

library(ggdag)
dag2 <- tidy_dagitty(dag)
p<-ggdag(dag2) +
theme_dag()

AA 2023-24 428 / 453


Applications
Causality: instrumental variables

AA 2023-24 429 / 453


Applications
Causality: instrumental variables

Let us assume that U is a common cause for both X and Y .

AA 2023-24 430 / 453


Applications
Causality: instrumental variables

dag <-
dagitty("dag {
X ->
Y
U ->
Y
U ->
X
}")
dag2 <- tidy_dagitty(dag)
p<-ggdag(dag2) + theme_dag()

AA 2023-24 431 / 453


Applications
Causality: instrumental variables

AA 2023-24 432 / 453


Applications
Causality: instrumental variables

In this case we cannot identify the direct causal effect X → Y as there is


the open confounding path X ← U → Y .

AA 2023-24 433 / 453


Applications
Causality: instrumental variables

As an exercise devise a Monte Carlo experiment and show that the OLS
estimator for the causal effect X → Y is biased.

AA 2023-24 434 / 453


Applications
Causality: instrumental variables

Suppose now that there exists another variable, say Z , such that Z → X
and causes Y only through X . To be specific. . .

AA 2023-24 435 / 453


Applications
Causality: instrumental variables

dag <-
dagitty("dag {
X ->
Y
U ->
Y
U ->
X
Z ->
X
}")
dag2 <- tidy_dagitty(dag)
p<-ggdag(dag2) + theme_dag()

AA 2023-24 436 / 453


Applications
Causality: instrumental variables

AA 2023-24 437 / 453


Applications
Causality: instrumental variables

Notice that the above graph corresponds to the set of equations

Y = g (Y , U) = a + bX + cU
X = h(Z , U) = d + eZ + fU

We are interested in the effect of X on Y and hence on the coefficient b.

AA 2023-24 438 / 453


Applications
Causality: instrumental variables

If we run a regression of Y on Z we estimate the compound effect e × b


and we can separately estimate e by running a regression of X on Z .

We can then estimate d by dividing the compound estimated effect e × b


by e. Notice that this is exactly what the TSLS does. (Prove it!)

The regression of Y on Z is called in the medical literature intent to treat


where Z is a randomizing method (a coin flip) and X is the actual
treatment.

AA 2023-24 439 / 453


Applications
Causality: instrumental variables

The ivreg package contains Card’s data on returns to schooling. Card’s


problem was to estimate the effect of education on income.

A simple linear model cannot capture the causal effect due to confounding
(common causes of income and education).

Card’s idea was to use geographic data (distance to college) as instruments


to consistently estimate the causal effect of education on income.

AA 2023-24 440 / 453


Applications
Causality: instrumental variables

Problem
Draw a DAG that features the relation between education, income,
geographic variation and confounding variables. Then, using Card’s data
(SchoolingReturns) estimate the effect of education on income using IV.

AA 2023-24 441 / 453


Applications
Causality: LATE

It is reasonable to believe that a policy affect different individuals in


different ways.

If individuals are thought to be heterogeneous the IV approach may fail but


we may be still interested in an average treatment effect (ATE).

Card’s idea, for example, was to use geographic data (distance to college)
as instruments to consistently estimate the causal effect of education of
income.

AA 2023-24 442 / 453


Applications
Causality: LATE

Let X be college attendance and Z indicate whether a person lives near


college. They are both binary.

Assume there are four groups of people


Compliers: they take the treatment when they are told to.
Always takers: they always take the treatment.
Never takers: they never take the treatment.
Defiers: they take the treatment when they are told not to.

AA 2023-24 443 / 453


Applications
Causality: LATE

Let X be college attendance and Z indicate whether a person lives near


college. They are both binary.

Assume there are four groups of people

P(X = 1|Z = 1, C ) = P(X = 0|Z = 0, C ) = 1, compliers


P(X = 1|Z = 1, A) = P(X = 1|Z = 0, A) = 1, always takers
P(X = 0|Z = 1, N) = P(X = 0|Z = 0, N) = 1, never takers
P(X = 0|Z = 1, D) = P(X = 1|Z = 0, D) = 1, defiers

You can attach an economic meaning to distance to college and interpret it


as a price. If price changes, demand changes. How would you interpret the
above types in light of this definition of distance?

AA 2023-24 444 / 453


Applications
Causality: LATE

Let T ∈ {C , A, N, D} and define


X
E [Y |Z = 1] − E [Y |Z = 0] = (E [Y |Z = 1, T ] − E [Y |Z = 0, T ])P(T ).
T

Moreover, the expected outcome conditional on Z = 1

E [Y |Z = 1, T ] = E [Y |Z = 1, X = 1, T ]P(X = 1|Z = 1, T )
+ E [Y |Z = 1, X = 0, T ]P(X = 0|Z = 1, T ).

AA 2023-24 445 / 453


Applications
Causality: LATE

It is possible to show, given certain assumptions, that the effect on the


compliers group C is

E [Y |Z = 1] − E [Y |Z = 0]
E [Y |X = 1, C ] − E [Y |X = 0, C ] =
P[X = 1|Z = 1] − P[X = 1|Z = 0]
µy − µy0
= 1
p11 − p10

Let us see how to estimate these quantities.

AA 2023-24 446 / 453


Applications
Causality: LATE

We simply use the sample analogue for each quantity.


PN
yi 1(zi = 1)
by1 = Pi=1
µ N
i=1 1(zi = 1)
PN
yi 1(zi = 0)
by0 = Pi=1
µ N
i=1 1(zi = 0)
PN
i=1 1(xi = 1, zi = 1)
pb11 = PN
i=1 1(zi = 1)
PN
i=1 1(xi = 1, zi = 0)
pb10 = PN
i=1 1(zi = 0)

AA 2023-24 447 / 453


Applications
Causality: LATE

# We generate a binary variable that


# indicates whether the subject has
# more than 12 years of schooling, i.e.
# s/he went to college
X2 <- X > 12 # college indicator
# using college proximity as an instrument.
mu_y1 <- mean(y[Z==1])
mu_y0 <- mean(y[Z==0])
p_11 <- mean(X2[Z==1])
p_10 <- mean(X2[Z==0])
# LATE
((mu_y1 - mu_y0)/(p_11 - p_10))

AA 2023-24 448 / 453


Applications
Causality: LATE

Problem
Implement the above code using Card’s data. Comment on the results.

AA 2023-24 449 / 453


Applications
Causality: LATE

library(ivreg)
data("SchoolingReturns")
y <- SchoolingReturns$wage
X <- 1*(SchoolingReturns$education>12)
Z <- 1*(SchoolingReturns$nearcollege=="yes")

AA 2023-24 450 / 453


Applications
Causality: LATE

mu_y1 <- mean(y[Z==1])


mu_y0 <- mean(y[Z==0])
p_11 <- mean(X[Z==1])
p_10 <- mean(X[Z==0])
# LATE
((mu_y1 - mu_y0)/(p_11 - p_10))

[1] 731

AA 2023-24 451 / 453


Applications
Causality: LATE

What happens if we estimate the treatment effect on the compliers using


the TSLS?

AA 2023-24 452 / 453


Applications
Causality: LATE

# build a matrix of regressors that


# include X and a vector of ones
X1 <- cbind(1,X)
# the same for the instruments
Z1 <- cbind(1,Z)
# the problem is just identified
# so we can use TSLS
beta.tsls <- solve(crossprod(Z1,X1))%*%crossprod(Z1,y)
beta.tsls

[,1]
208
X 731

AA 2023-24 453 / 453

You might also like