0% found this document useful (0 votes)
11 views

R Tutorial

This document provides an overview of topics that will be covered in an R tutorial, including getting started with R, reading and manipulating data, linear models, summary statistics, and plotting data. The key points are: downloading R and RStudio, reading different data file types into R, storing and manipulating data frames, using packages like tidyverse and ggplot2, fitting linear models with lm(), summarizing data with functions like summary() and mean(), and creating plots and other visualizations with commands like plot(), hist(), and boxplot().

Uploaded by

jasonmao6969
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

R Tutorial

This document provides an overview of topics that will be covered in an R tutorial, including getting started with R, reading and manipulating data, linear models, summary statistics, and plotting data. The key points are: downloading R and RStudio, reading different data file types into R, storing and manipulating data frames, using packages like tidyverse and ggplot2, fitting linear models with lm(), summarizing data with functions like summary() and mean(), and creating plots and other visualizations with commands like plot(), hist(), and boxplot().

Uploaded by

jasonmao6969
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

ECON 323

R Tutorial
Dr. Lucija Muehlenbachs
Reid Fortier
What We’ll Cover Today:

● Getting Started with R


● Reading data
● Types of data
● Storing and manipulating data
● Using packages
● Linear models
● Summary statistics
● Plotting data
Getting Started with R

● For those on their own computers, download R and RStudio here: RStudio Desktop -
Posit

● R is necessary, RStudio is highly recommended


● With RStudio we can write and save scripts, view data frames, and have a
much easier interface to navigate
● Common coding etiquette:
○ Write yourself as the author of your own script and date it
○ Provide comments in the script by prefacing a line with #
○ Name variables/data/files clearly and concisely
○ Set a working directory for your files to be read from
Reading Data

● R can read various document types, including


○ CSV files (read.csv())
○ TSV files (read_tsv())
○ Delimited files
○ Excel files (read_excel() or read_xls(), install readxl package first)
○ Stata .dta files (read_dta using the Haven package)
○ And more (SAS, SPSS, etc.)
● Entering a read data command in the R console shows a preview of the data,
but we can save it as a data frame using RStudio
○ I.e. instead of simply running read.csv(gasdata.csv), we can store it in a dataframe by running
mydata <- read.csv(gasdata.csv)
Types of Data

● Strings: characters encoded as non-numeric (e.g. “Hello World!”, “Alberta”,


“100”, etc.)
● Numeric: integers, floating point numbers, double, etc.
● Factors: data that is characterized by levels/categories. Numeric and string
variables can be encoded as string variables with the factor() command

In short, string variables will typically be for descriptive data. If they are needed for
mathematical/empirical applications, the must be encoded as factors.
Numeric/factor variables are typically for more “raw” data that we will work with.
Storing Data/Variables

● R stores data in multiple forms:


○ Data frames
○ Matrices
○ Vectors
■ Can be manually inputted (e.g. vector <- c())
○ Lists
● To save a dataset as a data frame in R, use the <- operator
○ E.g. mydata <- read.csv(“welldata.csv”)
○ Alternatively, you can combine multiple vectors of equal length into a single data frame with
cbind (e.g. mydata <- cbind(vector1, vector2, vector3)
● To save a column from a data frame as its own vector, use the same process
○ E.g. price <- mydata$gasprice
Data Manipulation
● Using the $ character, we can add variables to our dataset
○ E.g. mydata$logprice <- log(mydata$price)
● To keep or remove variables from our data, we can use subset()
○ E.g. mydata <- subset(mydata, select = c(variable1, variable2)) to keep variable1 and
variable2
○ Mydata <- subset(mydata, select = -c(variable1, variable2)) to drop variable1 and variable2
● Sometimes we may have missing data that we cannot impute. To remedy this
we can, if desired, delete these observations
○ The package tidyr is useful for this using the command drop_na
○ E.g. mydata <- mydata %>% drop_na(variable). If no variable is provided, the command drops
all missing values in the data frame.
Data Manipulation

● We can also use subset to create new data frames without erasing the
original from our environment:
○ mysubset <- subset(mydata, subset = criteria), where criteria is a logical expression indicating
which observations to keep and which to remove
○ Logical operators:
■ >: strictly greater than (< for strictly less than)
■ >=: greater than or equal to (<= for less than or equal to)
■ !=: not equal to
■ ==: equal to
■ E.g. mysubset <- subset(mydata, subset = Date > “1990-01-01”)
Data Manipulation

● For your assignment, you will need to aggregate observations to a monthly


level
● The command aggregate is useful for this as it allows us to specify which
variables to keep, which variable to aggregate over, and the function we want
to aggregate by (taking the mean, summing, etc.)
● E.g. aggData <- aggregate(c(X1, X2) ~ aggVar, FUN = mean)
○ X1 and X2 are vectors of data
○ aggVar is the variable we are aggregating over (examples could be by well cohort or by
month)
○ FUN takes the function by which we are aggregating (mean takes the mean of the chosen
variables, sum adds the variables together, max returns the maximum value, etc.)
Using Packages

● Any packages you want to use in R need to be both installed on the local disc
and called in R
● To install packages, go to the Packages tab in the lower left of RStudio and
use the search bar
● To call packages, run the library() command with the package name
○ E.g. run library(tidyverse) before executing any commands under the tidyverse package
● Some useful packages: readxl, Haven, tidyverse (data cleaning), ggplot2
(data plotting), stargazer (output tables), lubridate (date formatting), etc.
Linear Models

● The lm() command is used for basic linear modelling (i.e. y = 𝛽0 + 𝛽1X + 𝜖)
● Hint: use help(lm) to find what arguments the function lm takes (useful for
when you remember the command name, but not the necessary arguments)
● E.g. mymodel <- lm(Y ~ X1 + X2, data = mydata, subset = criterion)
○ Note that we can choose to subset the data within our linear model without having to create a
new dataframe with the subset command
● For cross-sectional and time series data, lm() should be sufficient
○ glm() is more flexible as it can model limited dependent variable (LDV) models (logit, probit,
tobit)
○ plm() is useful for balanced panel datasets and requires the plm package
Summarizing Data

● Once we run the lm() command in R, it is saved in our environment. We can


see the model results using the summary() command:
○ summary(mymodel) returns the coefficients, standard errors, t-statistics, and p-values for all
included independent variables
● Summary() is also used to report summary statistics of variables:
○ summary(X1) returns the mean, min/max, and quartiles of the variable X1
● Summary statistics can alternatively be called through their own function
○ mean(X1)
○ sd(X1)
○ var(X1)
○ max(X1)
Graphical Analysis: Plotting Data

● Base R uses the command plot() to plot two data vectors against each other
● plot(x, y, type, main, xlab, ylab, xlim, ylim)
○ x and y are the variables to be plotted on the x- and y-axes, respectively
○ Type gives the type of plot to be drawn (“l” for lines, “p” for points, “b” for both, etc.)
○ Main is the title of the figure
○ xlab and ylab are the x- and y-axis labels, respectively
○ xlim and ylim give the range of values for the x and y variables to be restricted to in the plot
■ E.g. xlim = c(xmin, xmax)
● x and y are the only necessary arguments to be passed, but it is a good
convention to appropriately name and label your figures
● Additional options for your figures include colouring the plot, choosing the size
of points, and choosing a specific aspect ratio (help(plot)!)
Graphical Analysis: Other Visualizations

● Histograms: use hist()


● Box-and-whisker plots: use boxplot()
● Barplot: use barplot()
● Pie chart: pie()
○ Remember to use the help() command to find arguments for each of these commands
● As noted earlier, the package ggplot2 is great for making nice looking figures,
but is a bit more complicated to learn.
● To save your figures, click on Export in the lower right of RStudio in the figure
preview window
Additional Topics we Can Cover if Time Permits

● Using ggplot2 for nicer looking figures


● Plotting multiple trends on top of each other
● Logging output

You might also like