R Tutorial
R Tutorial
R Tutorial
Dr. Lucija Muehlenbachs
Reid Fortier
What We’ll Cover Today:
● For those on their own computers, download R and RStudio here: RStudio Desktop -
Posit
In short, string variables will typically be for descriptive data. If they are needed for
mathematical/empirical applications, the must be encoded as factors.
Numeric/factor variables are typically for more “raw” data that we will work with.
Storing Data/Variables
● We can also use subset to create new data frames without erasing the
original from our environment:
○ mysubset <- subset(mydata, subset = criteria), where criteria is a logical expression indicating
which observations to keep and which to remove
○ Logical operators:
■ >: strictly greater than (< for strictly less than)
■ >=: greater than or equal to (<= for less than or equal to)
■ !=: not equal to
■ ==: equal to
■ E.g. mysubset <- subset(mydata, subset = Date > “1990-01-01”)
Data Manipulation
● Any packages you want to use in R need to be both installed on the local disc
and called in R
● To install packages, go to the Packages tab in the lower left of RStudio and
use the search bar
● To call packages, run the library() command with the package name
○ E.g. run library(tidyverse) before executing any commands under the tidyverse package
● Some useful packages: readxl, Haven, tidyverse (data cleaning), ggplot2
(data plotting), stargazer (output tables), lubridate (date formatting), etc.
Linear Models
● The lm() command is used for basic linear modelling (i.e. y = 𝛽0 + 𝛽1X + 𝜖)
● Hint: use help(lm) to find what arguments the function lm takes (useful for
when you remember the command name, but not the necessary arguments)
● E.g. mymodel <- lm(Y ~ X1 + X2, data = mydata, subset = criterion)
○ Note that we can choose to subset the data within our linear model without having to create a
new dataframe with the subset command
● For cross-sectional and time series data, lm() should be sufficient
○ glm() is more flexible as it can model limited dependent variable (LDV) models (logit, probit,
tobit)
○ plm() is useful for balanced panel datasets and requires the plm package
Summarizing Data
● Base R uses the command plot() to plot two data vectors against each other
● plot(x, y, type, main, xlab, ylab, xlim, ylim)
○ x and y are the variables to be plotted on the x- and y-axes, respectively
○ Type gives the type of plot to be drawn (“l” for lines, “p” for points, “b” for both, etc.)
○ Main is the title of the figure
○ xlab and ylab are the x- and y-axis labels, respectively
○ xlim and ylim give the range of values for the x and y variables to be restricted to in the plot
■ E.g. xlim = c(xmin, xmax)
● x and y are the only necessary arguments to be passed, but it is a good
convention to appropriately name and label your figures
● Additional options for your figures include colouring the plot, choosing the size
of points, and choosing a specific aspect ratio (help(plot)!)
Graphical Analysis: Other Visualizations