RIntro PLSCS6200
RIntro PLSCS6200
D G Rossiter
Topics – Part 1
3. Interacting with R
D G Rossiter
Introduction to R 2
• ...
D G Rossiter
Introduction to R 3
• ...
• R can import and export in MS-Excel, text, fixed and delineated formats (e.g. CSV),
with databases . . . ;
• R is a major part of the open source and reproducible research movement for
transparent and honest science.
D G Rossiter
Introduction to R 4
D G Rossiter
Introduction to R 5
• R console
• Code editor
– write one or more R commands, pass the commands to the console and see the text
output there
– advantage: can edit and re-run
– can save the script to reproduce the analysis
D G Rossiter
Introduction to R 6
• Workspace viewer
• File manager
• History viewer
• Package manager
– install (from CRAN) and load (in your workspace) additional packages
• Project manager
– can switch between data analysis projects, each in its own directory
D G Rossiter
Introduction to R 7
RStudio Screenshot
D G Rossiter
Introduction to R 8
>
• You can type directly after the prompt; press the Enter to submit the command to R
• Better: type a command in the code editor and click the Run button or press
Alt+Enter to pass the command to the console
• Text output (if any) will appear in the console; figures will appear the graphics window
D G Rossiter
Introduction to R 9
The S language
1. Origin; R vs. S
2. Expresssions
4. Functions
D G Rossiter
Introduction to R 11
Origin of S
D G Rossiter
Introduction to R 12
Origin of R
• 1990–1994 Ross Ihaka, Robert Gentleman at Univ. Auckland (NZ), for own teaching
and research
• 1997 Kurt Hornik and Fritz Leisch establishe the CRAN (Comprehensive R Action
Network) archive at TU Vienna
Expressions
R can be used as a command-line calculator; these S expressions can then be used
anywhere in a statemnt.
> 2*pi/360
[1] 0.0174533
> 3 / 2^2 + 2 * pi
[1] 7.03319
[1] 13.3518
D G Rossiter
Introduction to R 14
Assignment
Results of expressions can be saved as objects in the workspace.
[1] 0.0174533
D G Rossiter
Introduction to R 15
Workspace objects
• Create by assignment
> ls()
[1] "heights"
character(0)
D G Rossiter
Introduction to R 16
2. Argument list
(a) Required
(b) Optional, with defaults
(c) positional and/or named
These usually return some values, which can be complex data structures
D G Rossiter
Introduction to R 17
> rnorm(20)
[1] 180.99 180.89 180.64 181.64 179.45 179.90 179.04 179.62 178.94 180.66 179.35
[12] 180.16 179.31 179.66 178.05 180.07 181.58 179.37 179.08 180.21
[1] 171.90 179.90 189.82 191.80 182.41 187.19 162.89 202.09 185.78 188.01 174.15
[12] 183.09 158.83 175.42 166.60 188.93 181.84 177.15 167.56 177.75
D G Rossiter
Introduction to R 18
1. R help
2. R manuals
3. on-line R help
D G Rossiter
Introduction to R 19
> help(rnorm)
> ?rnorm
D G Rossiter
Introduction to R 20
• Description
• Value returned
• Source of code
D G Rossiter
Introduction to R 22
D G Rossiter
Introduction to R 23
R manuals
• Access in R Studio with the Help tab or Help | R help menu item
• FAQ
D G Rossiter
Introduction to R 24
on-line R help
• R task views
• StackOverflow R tags
• RSeek: http://www.rseek.org/
D G Rossiter
Introduction to R 25
StackOverflow
URL: http://stackoverflow.com/questions/tagged/r: “Stack Overflow is a
question and answer site for professional and enthusiast programmers.”
You can post questions, always with small, reproducible examples – often writing those
examples will give you the solution yourself!
D G Rossiter
Introduction to R 26
StackOverflow R tags
D G Rossiter
Introduction to R 27
RSeek results
D G Rossiter
Introduction to R 28
R Task Views
Some applications are covered in Task Views, on-line at
http://cran.r-project.org/web/views/index.html.
These are a summary by a task maintainer of the facilities in R (e.g., which packages and
functions to use) to accomplish certain tasks.
Examples:
• Multivariate Statistics
http://cran.r-project.org/web/views/Multivariate.html
These are often described in journal articles, books or technical reports, e.g.,
Baddeley, A., & Turner, R. (2004). spatstat: An R Package for Analyzing Spatial
Point Patterns. Journal of Statistical Software, 12(6). Retrieved from
http://www.jstatsoft.org/v12/i06
Diggle, P. J., & Ribeiro Jr., P. J. (2007). Model-based geostatistics. Springer. (the
geoR package)
D G Rossiter
Introduction to R 30
Installing packages
2. In RStudio: “Packages” pane, “Install” button; enter the names of the packages to install
3. Also check “Install dependecies” – most packages depend on others to also be on the
system
4. The first time you will be prompted to pick a repository, also known as mirror – R is
hosted at 100’s of sites around the world; they should all have the same packages
D G Rossiter
Introduction to R 31
Loading packages
The library and require functions (almost equivalent) load a package if it’s not already
in the workspace; they will also load dependencies (assuming these are installed on your
system):
D G Rossiter
Introduction to R 32
• Look at the help pages for methods you do know; they often list related methods.
D G Rossiter
Introduction to R 33
Example data
• Most packages also include example data, which are used to explain the packages’
functions and methods
• Once you know the dataset name, see its documentation with ? or help
• To load into the workspace, use the data function with the dataset name
D G Rossiter
Introduction to R 34
Example
> data()
> ?CO2
> data(CO2)
> library(sp)
> data(package="sp")
> ?meuse
> data(meuse)
CO2 is a dataset in the
datasets package
D G Rossiter
Introduction to R 35
Topics – Part 2
5. Logical expressions
7. Summarizing data
D G Rossiter
Introduction to R 37
• logical; integer; double; character are all vectors with one or more elements
D G Rossiter
Introduction to R 38
Examples:
D G Rossiter
Introduction to R 39
Vectorized operations
S works on vectors and matrices as with scalars, with natural extensions of operators,
functions and methods.
The ten integers 1 ...10 returned by the call to the seq (sequence) method each have a
different random noise added to them; here the rnorm method also returns ten values.
D G Rossiter
Introduction to R 40
[1] "integer"
[1] "numeric"
[1] "character"
[1] "matrix"
[1] "data.frame"
[1] "function"
D G Rossiter
Introduction to R 41
Examples
> letters; letters + 3
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
[20] "t" "u" "v" "w" "x" "y" "z"
chr [1:26] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" ...
D G Rossiter
Introduction to R 42
Matrices
> (cm <- c(35,14,11,1,4,11,3,0,12,9,38,4,2,5,12,2))
[1] 35 14 11 1 4 11 3 0 12 9 38 4 2 5 12 2
> dim(cm)
NULL
Initially, the vector has no dimensions; these are added with the dim function:
> dim(cm)
[1] 4 4
D G Rossiter
Introduction to R 43
Matrix arithmetic
Many S operators can work directly on matrices; there are also typical matrix functions:
• transposition: t function
• ...
D G Rossiter
Introduction to R 44
Data frames
The fundamental structure for statistical analysis; a matrix with:
We illustrate with one of R’s example datasets, provided in the base datasets package:
We first display the help file, then load the data, then view the data structure (field names
and types):
> ?trees
> data(trees)
> str(trees)
D G Rossiter
Introduction to R 45
> summary(trees$Volume)
D G Rossiter
Introduction to R 46
[1] 70 65 63 72 81 83 66 75 80 75 79 76 76 69
[15] 75 74 85 86 71 64 78 80 74 72 77 81 82 80
[ 29] 80 80 87
[1] 70
D G Rossiter
Introduction to R 47
> trees[1:3,] # first three cases (trees), all fields
> head(trees[,c(1,3)]) # first and third fields; `head' shows first six
Girth Volume
1 8.3 10.3
2 8.6 10.3
3 8.8 10.2
4 10.5 16.4
5 10.7 18.8
6 10.8 19.7
[1] 70
D G Rossiter
Introduction to R 48
Factors
• Variables with a limited number of discrete values (categories) are called S factors.
• Internally they are stored as integers but each has a text name.
• They are handled properly by R functions and methods (they are not integers!).
D G Rossiter
Introduction to R 49
Coefficients:
(Intercept) student
6.682e+00 3.022e-06
Meaningless!
D G Rossiter
Introduction to R 50
Example (2/2)
Convert to a factor: the student number is just an ID; use as.factor:
> str(tests)
This is a meaningful one-way linear model, showing the difference in mean scores of
students 201113 and 700123 from student 131444 (the intercept).
D G Rossiter
Introduction to R 51
Data manipulation
One of the strengths of R is the ability to manipulate data.
This is especially useful for automatic identification of suspected errors, outlier detection,
data transformations, subsetting on a factor . . .
D G Rossiter
Introduction to R 52
> sort(trees$Height)
> subset(trees, Height >= 80)
[1] 63 64 65 66 69 70 71 72 72 74 74 75 75 75 76 76 77 78 79 80 80 80 80 80 81 81
[27] 82 83 85 86 87
D G Rossiter
Introduction to R 53
Another way . . .
Can use logical expression as subscripts:
D G Rossiter
Introduction to R 54
[1] 5 6 9 17 18 22 26 27 28 29 30 31
D G Rossiter
Introduction to R 56
Thin trees
●
●
85
●
●
●
● ●
80 ● ● ●●
●
●
●
●
Height
75
●● ●
● ●
● ●
●
70
●
●
●
65
●
●
●
8 10 12 14 16 18 20
Girth
D G Rossiter
Introduction to R 57
Import/Export
Reference: “R Data Import/Export”, R manual installed with R; available under Help menu
D G Rossiter
Introduction to R 58
D G Rossiter
Introduction to R 59
But it was not able to do so for the factors ffreq, soil, lime. So, we have to convert:
D G Rossiter
Introduction to R 60
• field delimeters
• Header line(s)
• Skip lines
D G Rossiter
Introduction to R 61
File export
Very flexible write.table function.
There are also ways to export to spreadsheets, databases, images, GIS coverages . . .
D G Rossiter
Introduction to R 62
Statistical models in S
– lm (linear models)
– glm (generalised linear models)
– gstat methods such as variogram and krige
D G Rossiter
Introduction to R 63
• Formula operator ~
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -87.124 29.273 -2.98 0.00583
Height 1.543 0.384 4.02 0.00038
So, the tree volume is modelled as a linear function of the tree height.
D G Rossiter
Introduction to R 64
• Additive effects: +
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -57.988 8.638 -6.71 2.7e-07
Height 0.339 0.130 2.61 0.014
Girth 4.708 0.264 17.82 < 2e-16
• Interactions: *
> model <- lm(Volume ~ Height * Girth, data=trees); summary(model)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 69.3963 23.8358 2.91 0.00713
Height -1.2971 0.3098 -4.19 0.00027
Girth -5.8558 1.9213 -3.05 0.00511
Height:Girth 0.1347 0.0244 5.52 7.5e-06
D G Rossiter
Introduction to R 65
In this case it’s the same as Height * Girth, because there are only two factors.
• Nested models: /
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.23114 7.74157 -0.03 0.9764
Height -0.41218 0.12316 -3.35 0.0023
Height:Girth 0.06070 0.00266 22.79 <2e-16
Coefficients:
Estimate Std. Error t value Pr(>|t|)
Height 0.4047 0.0354 11.4 1.9e-12
D G Rossiter
Introduction to R 66
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -30.19193 15.02843 -2.01 0.05393
I(Height^2) 0.01038 0.00255 4.07 0.00033
D G Rossiter
Introduction to R 67
Updating models
Use the update function, previous LHS and RHS represented by .
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -57.988 8.638 -6.71 2.7e-07
Height 0.339 0.130 2.61 0.014
Girth 4.708 0.264 17.82 < 2e-16
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -87.124 29.273 -2.98 0.00583
Height 1.543 0.384 4.02 0.00038
D G Rossiter
Introduction to R 68
[1] "lm"
List of 12
$ coefficients : Named num [1:2] -461 114
..- attr(*, "names")= chr [1:2] "(Intercept)" "log(Height)"
$ residuals : Named num [1:31] -10.928 -2.511 0.939 -8.028 -19.005 ...
..- attr(*, "names")= chr [1:31] "1" "2" "3" "4" ...
...
D G Rossiter
Introduction to R 69
> summary(residuals(model))
Other important access functions: summary, fitted, coef, anova, effects, vcov.
D G Rossiter
Introduction to R 70
Factors
• These are converted to contrasts in the design matrix of linear (and other) models
D G Rossiter
Introduction to R 71
Topics – Part 3
1. R base graphics
2. Scripts
3. User-defined functions
D G Rossiter
Introduction to R 72
R Graphics
R has a very rich visualization environment. There are (at least) four graphics systems:
4. Grid graphics
R graphics are highly customizable; it is usual to write small scripts to get the exact
output you want.
Graphs may be displayed on screen or written directly to files for inclusion in documents.
D G Rossiter
Introduction to R 73
Base graphics
• Simple to learn
D G Rossiter
Introduction to R 74
2.5
● ●●
● ●
●●●● ● ● ● ●
● ● ●
●●●● ● ●
2.0
●●●● ● ●
●● ● ●
●●
●● ● ●● ● ● ●
● ●
● ● ● ●
1.5
Petal.Width
● ●●● ●●●
● ● ●●● ●
● ●●●●
●●●●●
●● ● ● ●
● ●●
1.0
● ● ● ●●
●
0.5
●
● ●●● ●
●●● ●
● ●●
●●●●
●●
●●
●● ●
● ●●
1 2 3 4 5 6 7
Petal.Length
D G Rossiter
Introduction to R 75
2.5
● ●●
●
setosa ●●●● ● ● ● ●
●
versicolor ● ● ●
●
virginica ●●●● ● ●
2.0
●●●● ● ●
●● ● ●
●● ● ●● ● ● ●
● ●
● ● ● ●
Petal width (cm)
1.5
● ●●● ●●●
● ● ●●● ●
● ●●●●●●●
●● ● ● ●
● ●●
1.0
● ● ● ●●
●
0.5
● ●●● ●
●●● ●
● ●●●●●● ●
● ●●
1 2 3 4 5 6 7
D G Rossiter
Introduction to R 76
Note that plot starts a new graph; all the others add elements to the plot.
D G Rossiter
Introduction to R 77
Trellis graphics
An R implementation of the trellis graphics system developed at Bell Labs by Cleveland is
provided by packakge lattice.
• Can produce higher-quality graphics, especially for multivariate visualisation when the
relationship between variables changes with some grouping factor; this is called
conditioning the graph on the factor
• It uses model formulae similar to the statistical formulae to specify the variables to be
plotted and their relation in the plot.
• Multiple items on one plot are specified with user-written panel functions
D G Rossiter
Introduction to R 78
versicolor ● 1 2 3 4 5 6 7
virginica ●
setosa versicolor virginica
2.5 ● ●●
2.5 ● ●● ● ●
● ● ●●●● ●●● ●
●●●● ● ● ● ● ●● ●
● ● ● ●●●●● ●
●●●● ● ●
2.0 ●●●● ●●
●● ● ● ● ●●● ●●●● ●
●● ● ●● ● ● ●
● ●
● ●
●● ● ●
Petal.Width
● ● ● ●
1.5 ● ●●●● ●●
Petal.Width
●
●
0.5 ● 0.5 ●
● ●●● ● ●●●●●
●●● ● ●●●●
● ●●●●●● ● ●●●
●●
●●
●●●●●
● ●● ● ●●
0.0 0.0
1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7
Petal.Length Petal.Length
Note the right plot: it has been conditioned on a factor, namely the species.
D G Rossiter
Introduction to R 79
D G Rossiter
Introduction to R 80
Grammar of graphics
• Text: Wickham, H., 2009. ggplot2: Elegant Graphics for Data Analysis, Use R! Springer.
• But the qplot “quick plot” method can be used for many simple cases (analogous to
plot of base graphics).
D G Rossiter
Introduction to R 81
● ●
●
● ●●
●●
●
● ●● ●
●● ●
● ●
● ●
2.5 ●
●●
●
2.5 ●
●●
● ●●
●● ●
●
● ●
●●● ● ●● ●
● ●●
● ● ● ●
●● ● ●
●
● ●● ● ● ●● ●
● ●
● ● ●
● ● ● ● ● ●
● ●● ● ● ●
●
● ● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ● ●● ●
● ● ● ● ● ●
● ● ●● ● ffreq
log10(lead)
log10(lead)
● ●
● ●●
● ●
●● ●
● ● ● ● ●●
● ●● ●
● ●● ● ● ● ● ● ● 1
●● ● ●
● ●● ●●
● ● ●● ● ●
●
● ● ●
● 2
●● ● ● 2.0 ● ● ● ● ● ●
● ● ●
● ● ● ● ●● ●
● 3
● ●
2.0 ● ● ●
● ● ● ●● ● ● ●
● ● ● ● ● ● ● ● ●
● ●
●
● ● ●
● ● ● ●●
● ● ● ●●● ● ●
● ● ● ●● ● ● ● ● ● ● ●
● ● ●
● ● ● ● ●
●● ● ● ● ● ●
● ● ●
●●● ● ● ● ● ●
● ● ● ● ● ●● ● ●● ● ● ●
● ● ● ● ● ●
●
● ● ●
● ● ●
● ● ● ●
●● ● ●● ● ● ● ●
● ● ● ●
●
● ●
1.5
●
●
1.5
D G Rossiter
Introduction to R 82
• two geometries are specified: (1) the points (a scatterplot); (2) a smooth line
• the x and y axes are the two named variables; the Pb content is log-transformed
• in the right-hand graph the points are coloured by a categorical variable (flood
frequency class)
• the smooth line and confidence limits are computed by locally-adjusted least squares
D G Rossiter
Introduction to R 83
Grid graphics
A low-level graphics programming language by Paul Murrel. lattice is written in grid.
Allows fine control of graphic output.
http://www.stat.auckland.ac.nz/~paul/grid/grid.html
http://www.stat.auckland.ac.nz/~paul/RGraphics/rgraphics.html
D G Rossiter
Introduction to R 84
Programming R
R is a full-featured, modern programming language. This can be accessed four ways, in
increasing level of complexity:
2. User-written scripts
3. User-defined functions
4. User-contributed packages
D G Rossiter
Introduction to R 85
Control structures
S has ALGOL-like control structures:
• if ...else
• while, repeat
• break, next
D G Rossiter
Introduction to R 86
3
● ● ● ●●
● ● ●● ●
● ●●
● ●●
● ● ●
2
●
●● ●
● ●
● ● ● ●
● ● ● ● ●
● ● ●
1
● ● ●
● ● ● ●
● ●●
● ● ●●●
● ●
●
0
y
●● ●
●● ● ● ●
●
●
●●
● ●● ● ● ●
−1
●
●
● ● ● ●
●
● ●●
−2
●
●
● ●●
● ● ●
●
●● ● ●
●
−3
● ● ●
−3 −2 −1 0 1 2 3
D G Rossiter
Introduction to R 87
D G Rossiter
Introduction to R 88
D G Rossiter
Introduction to R 89
D G Rossiter
Introduction to R 90
D G Rossiter
Introduction to R 91
Example
1. Enter the following in a plain text file:
# draw two independent normally-distributed samples
x <- rnorm(100, 180, 20); y <- rnorm(100, 180, 20)
# scatterplot
plot(x, y)
# correlation: should be 0
cor.test(x, y, conf=0.9)
> source('test.R')
The script can be run several times, also with different numbers of runs and sample sizes,
to compare the results.
D G Rossiter
Introduction to R 93
Results
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.382500 -0.071820 -0.001889 -0.001115 0.072200 0.355600
D G Rossiter
Introduction to R 94
User-defined functions
• Why?
D G Rossiter
Introduction to R 95
1/n
Y
v̄h = vi
i=1...n
[1] "function"
[1] 37.6231
[1] 50
D G Rossiter
Introduction to R 96
A better version
A function should check for valid inputs. This shows the use of the if, else if, else
control structure:
D G Rossiter
Introduction to R 97
Another example
The “correlation of two random normal vectors” script can be converted to a function; the
arguments are the number of runs and sample size:
Try it! The second histogram will be much more erratic than the first.
D G Rossiter
Introduction to R 98
D G Rossiter
Introduction to R 99
Modelling
Simulation
D G Rossiter
Introduction to R 101
D G Rossiter
Introduction to R 102
• Contributed documentation
• Textbooks
• Task views
D G Rossiter
Introduction to R 103
General introductions
• Hornik, K. 2007. R FAQ: Frequently Asked Questions on R. Also updated with each
version.
What is R? Why ‘R’ ? Availability, machines, legality, documentation, mailing lists . . .
These are updated with each R release.
D G Rossiter
Introduction to R 104
On-line help
• On the internet
– RSeek: http://www.rseek.org/
– RSiteSearch method
D G Rossiter
Introduction to R 105
RSeek results
D G Rossiter
Introduction to R 106
D G Rossiter
Introduction to R 107
Textbooks using R
More and more texts are using R code to illustrate their statistical analyses.
D G Rossiter
Introduction to R 108
• data manipulation
• Bayesian analysis
• time-series
• interactive graphics
List at http://www.springer.com/series/6991
D G Rossiter
Introduction to R 109
http://www.css.cornell.edu/faculty/dgr2/tutorials/index.html
These include general data analysis, logistic regression, confusion matrices, co-kriging,
partioning transects, and fitting rational functions.
D G Rossiter
Introduction to R 110
R Task Views
Some applications are covered in so-called Task Views, on-line at
http://cran.r-project.org/web/views/index.html.
These are a summary by a task maintainer of the facilities in R (e.g., which packages and
functions to use) to accomplish certain tasks. Examples:
• Multivariate Statistics
http://cran.r-project.org/web/views/Multivariate.html
D G Rossiter
Introduction to R 111
Daily new and modified packages added to CRAN; new versions of the R base
appear 2–4x yr-1
(continued . . . )
D G Rossiter
Introduction to R 112
...
D G Rossiter
Introduction to R 113
Topics – Part 4
2. The Tidyverse
D G Rossiter
Introduction to R 114
Literate programming:
• both code and comments in the same document; code is executed and produces the
results seen in the document; no cut-and-paste
• if data changes, document changes (code is the same, results are different!)
See: Rossiter, DG 2012. Technical Note: Literate Data Analysis using the R environment
for statistical computing and the knitr package 26-December-2012, 35 pp;
http://www.css.cornell.edu/faculty/dgr2/_static/files/R_PDF/LDA.pdf
D G Rossiter
Introduction to R 115
The Tidyverse
• Defines a syntax for pipes (magrittr package), for sequences of operations without
having to define intermediate workspace objects
• Defines the tibble: “a modern re-imagining of the data frame, keeping what time has
proven to be effective, and throwing out what it has not.”
1 https://www.tidyverse.org
2 https://r4ds.had.co.nz
D G Rossiter
Introduction to R 116
Pipes
Example:
the_data <-
read.csv('/path/to/data/file.csv') %>%
subset(variable_a > x) %>%
transform(variable_c = variable_a/variable_b) %>%
head(100)
• the results of each expression are passed to the next with the pipe operator %>%.
D G Rossiter