Presentation of R
Presentation of R
Presentation of R
An Introduction to R
John Verzani
Outline
What is R?
The many faces of R
Data
Manipulating data
Applying functions to data
Vectorization of data
Graphics
Model formulas
Inference
Significance tests
confidence intervals
Models
Simple linear regression
Multiple linear regression
Analysis of variance models
Logistic regression models
An Introduction to R
What is R?
The structure of R
Library
Base
base
CLI grid
stats
Batch stats4
...
DCOM,... R Recommended
Kernel library() MASS
nlme
...
Contrib. (CRAN)
UsingR
text graphics
gregmisc
device
...
An Introduction to R
The many faces of R
Windows interface
Mac OS X interface
2+2
> 2 + 2
[1] 4
An Introduction to R
Data
Data types
> somePrimes
[1] 2 3 5 7 11 13 17
Matrices
Matrices, cont.
Operations: multiplication. (* is entry-by-entry
> M %*% M
[,1] [,2]
[1,] 1 2
[2,] 0 1
> solve(M)
[,1] [,2]
[1,] 1 -1
[2,] 0 1
An Introduction to R
Data
Matrices, cont.
> x = 1:5
> y = c(2, 3, 1, 4, 5)
> ones = rep(1, length(x))
> X = cbind(ones, x)
> solve(t(X) %*% X, t(X) %*% y)
[,1]
ones 0.9
x 0.7
An Introduction to R
Data
Lists
Defining lists
$b
[,1] [,2]
[1,] 1 1
[2,] 0 1
$c
function (x, ...)
UseMethod("mean")
<environment: namespace:base>
An Introduction to R
Data
Data frames
● ● ●
● ● ●
● ● ●
● ● ●
Vector
● ● ●
An Introduction to R
Data
Reading in data
Data can be built-in, entered in at the keyboard, or read in from
external files. These may be formatted using fixed width format,
commas separated values, tables, etc. For instance, this command
reads in a data set from a url:
Reading urls
> f = "http://www.math.csi.cuny.edu/st/R/crackers.csv"
> crackers = read.csv(f)
> names(crackers)
[1] "Company" "Product"
[3] "Crackers" "Grams"
[5] "Calories" "Fat.Calories"
[7] "Fat.Grams" "Saturated.Fat.Grams"
[9] "Sodium" "Carbohydrates"
[11] "Fiber"
An Introduction to R
Data
Manipulating data
Assignment, Extraction
Values in vectors, matrices, lists, and data frames can be accessed
by their components:
By index
> google[1:3]
[1] 100.2 132.6 196.0
By index cont.
> M[1, ]
[1] 1 1
> lst[[1]]
[1] 2 3 5 7 11 13 17
An Introduction to R
Data
Manipulating data
Access by name
by name
> theSimpsons[["role"]]
[1] Comic relief Parent troublemaker
[4] Goody two-shoes Cute baby
5 Levels: Comic relief Cute baby ... troublemaker
An Introduction to R
Data
Manipulating data
Recycling values
When making assignments in R we might have a situation where
many values are replaced by 1, or a few. R has a means of
recycling the values in the assignment to fill in the size mismatch.
replace coded values with NA
The median
> median(fat)
[1] 3.25
An Introduction to R
Data
Applying functions to data
Functions
generic functions
Interacting with R from the command line requires one to
remember a lot of function names, although R helps out
somewhat. In practice, many tasks may be viewed generically:
E.g., “print” the values of an object, “summarize” values of an
object, “plot” the object. Of course, different objects should yield
different representations.
R has methods (S3, S4) to declare a function to be generic. This
allows different functions to be “dispatched” based on the “class” of
the first argument.
A basic template is:
methodName( object, extraArguments)
> summary(somePrimes)
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.00 4.00 7.00 8.29 12.00 17.00
> summary(gender)
Female Male
2 3
An Introduction to R
Data
Vectorization of data
Vectorizing a simulation
It is often faster in R to vectorize the simulation above by
generating all of the random data at once, and then applying the
mean() function to the data. The natural way to store the data is
a matrix.
Simulation using a matrix
Graphics
0.20
0.15
Density
0.10
0.05
0.00
0 2 4 6 8
fat
An Introduction to R
Graphics
Graphics: cont.
Quantile-Quantile plots
> qqnorm(fat)
Boxplots
Quantile-Quantile plot
Normal Q−Q Plot
●● ● ● ●
8
●●●
●●●●●●●●●●●●
6
Sample Quantiles
●
●●●
●●
●
●●
●
●
●●
●●
●●
●
4
●
●●
●●
●●
●
●●
●●
●●
●●
●●
●●
●●
●
●
●●
●●●
●●●
●
2
●●●●●●●●●●●
● ●●●●●
● ● ●
0
−2 −1 0 1 2
Theoretical Quantiles
An Introduction to R
Graphics
Boxplots
50 ●
●
45
40
35
30
25
20
2.0
350
300
1.8
250
234 (65%)
Number of Vessels
Completeness
1.6
200
150
159 (44%)
1.4
100
1.2
1.2
50
1.0
0
Sampling Fraction
An Introduction to R
Graphics
3-d graphics
X1
X2 X3
X1
X2 X3
An Introduction to R
Graphics
Model formulas
y ~ 1 yi = β0 + εi
y ~ x yi = β0 + β1 xi + εi
y ~ x - 1 yi = β1 xi + εi remove the intercept
y ~ x | f yi = β0 + β1 xi + εi grouped by levels of f
y ~ f yij = τi + εij
Lattice graphics
● ● ●● ● ● 25
20
Van Large Midsize
50
45
40
35
●
30 ● ●●● ●
●
●
● ● ●● ●●●●
●● ●
●●● ●
25 ● ● ●
●● ●
●● ● ● ●
●
● ●●
20
Waseca
Trebi ●
Wisconsin No. 38 ●
No. 457 ●
Glabron ●
Peatland ●
Velvet ●
No. 475 ●
Manchuria ●
No. 462 ●
Svansota ●
Crookston
Trebi ●
Wisconsin No. 38 ●
No. 457 ●
Glabron ●
Peatland ●
Velvet ●
No. 475 ●
Manchuria ●
No. 462 ●
Svansota ●
Morris
Trebi ●
Wisconsin No. 38 ●
No. 457 ●
Glabron ●
Peatland ●
Velvet ●
No. 475 ●
Manchuria ●
No. 462 ●
Svansota ● 1932 ●
University Farm 1931
Trebi ●
Wisconsin No. 38 ●
No. 457 ●
Glabron ●
Peatland ●
Velvet ●
No. 475 ●
Manchuria ●
No. 462 ●
Svansota ●
Duluth
Trebi ●
Wisconsin No. 38 ●
No. 457 ●
Glabron ●
Peatland ●
Velvet ●
No. 475 ●
Manchuria ●
No. 462 ●
Svansota ●
Grand Rapids
Trebi ●
Wisconsin No. 38 ●
No. 457 ●
Glabron ●
Peatland ●
Velvet ●
No. 475 ●
Manchuria ●
No. 462 ●
Svansota ●
20 30 40 50 60
Significance tests
The Bumpus data set (Ramsey and Shafer) contains data from
1898 lecture supporting evolution (Some birds survived a harsh
winter storm)
two-sample t test
Diagnostic plot
780
760
740
humerus
720
700
●
680
660
1 2
factor(code)
An Introduction to R
Inference
Significance tests
t.test() output
Diagnostic plot
2.0 ●
●
●
●
1.8
●
●
●
1.6
●
affected
●
1.4
●
●
● ●
●
1.2
●
1.0
unaffected
An Introduction to R
Inference
confidence intervals
Confidence intervals
> t.test(Bumpus$humerus)
...
95 percent confidence interval:
728.2 739.6
...
An Introduction to R
Inference
confidence intervals
Chi-square tests
Goodness of fit tests are available through chisq.test() and
others. For instance, data from Rosen and Jerdee (1974, from
R&S) on the promotion of candidates based on gender:
gender data
sieveplot(rj)
Sieve diagram
F
gender
M
Y N
promoted
An Introduction to R
Inference
confidence intervals
> chisq.test(rj)$p.value
[1] 0.05132
> source("http://www.math.csi.cuny.edu/st/R/fat.R")
> names(fat)
[1] "case" "body.fat" "body.fat.siri"
[4] "density" "age" "weight"
[7] "height" "BMI" "ffweight"
[10] "neck" "chest" "abdomen"
[13] "hip" "thigh" "knee"
[16] "ankle" "bicep" "forearm"
[19] "wrist"
An Introduction to R
Models
Simple linear regression
Call:
lm(formula = body.fat ~ BMI, data = fat)
Coefficients:
(Intercept) BMI
-20.41 1.55
An Introduction to R
Models
Simple linear regression
40 ●
●
●
●
● ● ●
●
●●
● ● ● ●●
● ●●● ● ●
30
● ● ●● ● ●
● ● ●● ● ● ●●
● ●
● ●●●
●● ●●● ●●● ● ● ●
body.fat
●●● ● ● ● ●
●●●●●● ● ●● ● ●●
● ● ● ●● ●
●● ●● ● ● ●
●●●●
● ● ●●
● ● ●
●● ● ● ● ● ●
● ● ●
20
● ●● ●●●●● ● ● ●
● ●● ●●
● ● ● ●● ●
● ● ●● ●● ● ●
●
●●●● ●●● ●
● least−squares
● ●●●●●
● ●● ●
● ●● ●
●
● ●
●● ●● ●
●● ●
●● ● lqs
● ●
●●
●
● ●● ● ●
●● ●●●
● ● ● ●● ●●● ●
●
● ●●●● ●● ●
10
● ●● ●●●● ● ●
●● ●●●
●
● ● ●
●● ●
●● ● ●● ● ●
● ● ●
● ● ●● ●
●
●
●
0
20 25 30 35 40 45 50
BMI
An Introduction to R
Models
Simple linear regression
> summary(res)
Call:
lm(formula = body.fat ~ BMI, data = fat)
Residuals:
Min 1Q Median 3Q Max
-21.4292 -3.4478 0.2113 3.8663 11.7826
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -20.40508 2.36723 -8.62 7.78e-16 ***
BMI 1.54671 0.09212 16.79 < 2e-16 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’
An Introduction to R
Models
Simple linear regression
> summary(residuals(res))
Min. 1st Qu. Median Mean 3rd Qu.
-2.14e+01 -3.45e+00 2.11e-01 3.52e-17 3.87e+00
Max.
1.18e+01
An Introduction to R
Models
Simple linear regression
Residual plots
● ● ● ● ● ●
● ●
10
10
● ● ●●
● ● ● ●●●
● ●● ● ●●●●
● ● ● ●
●●
● ●●● ● ● ● ● ●
●●
●●
●●
●
●● ● ● ● ●
●●
●●
●
●● ● ● ●● ● ●● ●
●●
●
●●
●
●●
●● ● ● ● ● ● ●● ●
●
●●
●
●●
●
● ●● ●
●
●●
5
5
● ● ● ● ●
●
●
●●
●● ●●●● ● ●
●
●
● ● ●● ● ●● ●
● ● ● ●
●●
●
●
●
●
●●● ●● ● ● ●●
●
●
●
●
●
●● ● ● ● ● ● ●● ●●● ●
●●
●
●●
● ● ● ●
●● ●● ● ● ● ● ● ●
●●
●
●●
●● ●
● ● ● ● ● ●●
●
●●
● ● ● ● ●●●● ● ●●
Sample Quantiles
●● ● ●
●●
●
0
0
● ●
● ● ●● ● ●● ●
● ● ●
● ●●
●
●
●
●
●
● ●
● ●● ● ● ●● ●●
● ● ●● ●● ●● ● ● ●
●
●
●
●
●●● ● ● ● ● ●
●
resid(res)
● ● ● ●
●●
●●
●● ● ● ● ● ● ● ●●
●●
● ● ●●●● ●● ●● ● ●● ● ●
●●
●●
●
● ●● ● ● ● ● ●
●
●●
●
●●
−5
−5
● ● ● ● ●● ●●
●
●
●●
● ● ● ● ●● ● ●
●●
●
●
●●
●
● ●● ● ●
●
●●
● ●● ● ●● ● ● ●
●●
●
●
●
●
●
● ●● ● ●
●
●
● ● ● ● ● ●
●●
●●
●
● ● ● ● ●
●●●●
●●
−10
−10
● ● ● ● ●●●●●
● ● ●●
● ●
−15
−15
−20
●
−20 ●
10 20 30 40 50 −3 −2 −1 0 1 2 3
Standardized residuals
●●●
81 81●
●●
10 ● ● ● ●●●
2
●● ● ●
● ●●● ●●
●
●
●
●
●●●● ●●● ● ●
●●
●
●
●●
●
●●
●
●●
●●
● ●
● ●●
●
●
●●● ●●●
● ●
●●
●
●
●●●●●●●●
●
● ● ●
●●
●
●
●●
●
●●
●
●●●●●
●●●● ●
●●●●●● ●● ● ●
●
●●
●●
●
●● ●
Residuals
●●● ●●
● ●● ●●● ●●●●
●●●●
●● ●
●
●
●●
●
●
●●● ●●●●● ●●
●
●●●●●● ● ●● ●●
●
●
●●
●
●
●●● ●● ● ●
●
0
0
●● ● ●
●
● ●●●
●●● ● ●
●●
●
●
●●
●●● ● ●●●● ● ●● ● ●
●
●
●●
●● ●●●●
●●
●
●
●
●
●
●
●●
●
● ● ●●
●●
● ●
●●
●●●
●●● ●● ●
●
●
●
●
●
●●
●
●●
●
●
●● ● ● ● ● ● ● ●
●
●●
●
●
●●●
●●●●
● ● ● ● ●
●
●●
● ●● ● ●●
●
●
−20 −10
● ●
● ● ● ●● ●
●
●●
●
●
●●
● ●● ● ● ●●
●●
●
●
−2
●●
●● ● ●●
9 9●
−4
39●
39●
10 20 30 40 50 −3 −2 −1 0 1 2 3
39●
0.0 0.5 1.0 1.5 2.0
9●●●●
●
81 ●
●● ●●●● ●● ●● ●
●● ●●
●●●
● ●●● ● ●
● ●● ●
● ●● ●●● ● ●●
●●●●
● ●
●
● ●●
●●
●●● ●
●● ●● ●●●●●● ●● ●●
●
●● ●● ●
●
●● ●● ●●●● ●● ●
●
●●●●●●●●
●
●
●●●
● ●● ● ●●
●● ●●● ●● ●
●●● ●●
●●
●●● ●●●
●
●
●
●●
●● ●●
●● ●● ●●
●● ●
●●
● ●● ●
●●
●●●
● ●●● ●
● ●●●● ●
●
●●
●●
● ● ●●
● ●● ●● ●●●● ●
● ●●● ●● ●
●
●● ● ● ●●●● ●● ●
●●● ●● ●●●● ●
● 41 216
Multiple regression
Call:
lm(formula = body.fat ~ age + weight + height + chest + abd
Coefficients:
(Intercept) age weight height
-33.27351 0.00986 -0.12846 -0.09557
chest abdomen hip thigh
-0.00150 0.89851 -0.17687 0.27132
An Introduction to R
Models
Multiple linear regression
Call:
lm(formula = body.fat ~ weight + abdomen + thigh, data = fa
Coefficients:
(Intercept) weight abdomen thigh
-48.039 -0.170 0.917 0.209
50
40
Months
30
●
●
●
20
●
● ●
●
●
●
10
treatment
An Introduction to R
Models
Analysis of variance models
Response: lifetime
Df Sum Sq Mean Sq F value Pr(>F)
treatment 5 12734 2547 57.1 <2e-16 ***
Residuals 343 15297 45
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’
An Introduction to R
Models
Analysis of variance models
logistic regression
● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ●
55 60 65 70 75 80
Temperature
An Introduction to R
Models
Logistic regression models
mixed effects
Mixed-effects models are fit using the nlme package (or its new
replacement lmer which can also do logistic regression).
I Need to specify the fixed and random effects
I Can optionally specify structure beyond independence for the
error terms.
I The implemention is well documented in Pinheiro and Bates
(2000)
An Introduction to R
Models
Random effects models
Mixed-effects example
Coefficients:
(Intercept) I(age - 11)
24.02 0.66
An Introduction to R
Models
Random effects models
F11 ●
F04 ●
F03 ●
F08 ●
F02 ●
F07 ●
F05 ●
F01 ●
F06 ●
F09 ●
F10 ●
M10 ●
M01 ●
M04 ●
M06 ●
M15 ●
M09 ●
M14 ●
M13 ●
M12 ●
M03 ●
M08 ●
M07 ●
M11 ●
M02 ●
M05 ●
M16 ●
−5 0 5
resid(res.lm)
An Introduction to R
Models
Random effects models
M04 | | | | | |
M06 | | | | | |
M15 | | | | | |
M09 | | | | | |
M14 | | | | | |
M13 | | | | | |
M12 | | | | | |
M03 | | | | | |
M08 | | | | | |
M07 | | | | | |
M11 | | | | | |
M02 | | | | | |
M05 | | | | | |
M16 | | | | | |
20 ●
● ● 25
Distance from pituitary to pterygomaxillary fissure (mm)
● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● 20
● ●
●
20
● ● ● ●
● ● ●
25 ● ● ●
● ● ●
● ● ● ● ● ●
● ● ● ● ●
● ● ●
20 ●
8 9 10 11 12 13 14 8 9 10 11 12 13 14 8 9 10 11 12 13 14
Age (yr)
An Introduction to R
Models
Random effects models
Male Female
2
● ● ●
●
● ●
● ●
●● ● ● ● ● ● ● ●
Residuals (mm)
● ● ● ● ● ●
● ● ●● ● ● ●
● ● ●
● ● ●
●● ● ● ●
● ● ● ●
0 ●
●● ●● ●● ● ●
●● ● ●● ● ●
●
●
●● ●
● ● ● ● ● ● ●
● ●
● ● ●
● ● ● ●●● ●
● ● ●
● ●●● ● ● ●● ●
● ●
● ●
●
−2
●
−4
●
20 25 30
F04 ● ●
F03 ● ●
F08 ● ●
F02 ● ●
F07 ● ●
F05 ● ●
F01 ● ●
F06 ● ●
F09 ● ●
F10 ● ●
M10 ● ●
M01 ● ●
M04 ● ●
M06 ● ●
M15 ● ●
M09 ● ●
M14 ● ●
M13 ● ●
M12 ● ●
M03 ● ●
M08 ● ●
M07 ● ●
M11 ● ●
M02 ● ●
M05 ● ●
M16 ● ●
8 9 10 11 12 13 14
20 ●
●
Distance from pituitary to pterygomaxillary fissure (mm)
● 25
● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● 20
● ●
●
20
● ● ● ●
● ● ●
25 ● ● ●
● ● ●
● ● ● ● ● ●
● ● ● ● ●
● ● ●
20 ●
8 9 10 11 12 13 14 8 9 10 11 12 13 14 8 9 10 11 12 13 14
Age (yr)
An Introduction to R
Models
Random effects models
helper functions
plotCircle = function(x,radius=1,...) {
t = seq(0,2*pi,length=100)
polygon(x[1]+ radius*cos(t), x[2]+radius*sin(t), ...)
}
doPlot = function(x,r,R,...) {
if(sqrt(sum(x^2))+r < R) plotCircle(x,r=r,...)
}
An Introduction to R
Extras
Key points
Calling function
plotPizza = function(n,R=n,r=0.2) {
par(mai=c(0,0,0,0))
plot.new();plot.window(xlim=c(-n,n),ylim=c(-n,n),asp=1)
plotCircle(c(0,0),radius=R,lwd=2)
x = rep(-n:n,rep(2*n+1,2*n+1))
y = rep(-n:n,length.out=(2*n+1)^2)
apply(cbind(x,y),1,function(x) doPlot(x,r,R,col=gray(.5))
}
plotPizza(5)
An Introduction to R
Extras
Extending R with add-on packages
> install.packages("Rcmdr")
This GUI, available for the three main platforms, allows one to
select variables, and fill in function arguments with a mouse.
The command install.packages() allows one to browse the
available packages.
An Introduction to R
Extras
Extending R with add-on packages
Learning more