BIO360 Biometrics I, Fall 2007
H. Wagner, Biology UTM
R cheat sheet
Modified from: P. Dalgaard (2002). Introductory Statistics with R. Springer, New York.
1.
4.
Basics
Commands
objects()
ls()
rm(object)
List of objects in workspace
Same
Delete object
Assignments
<=
Assign value to a variable
Same
Getting help
help(fun)
args(fun)
Display help file for function fun()
List arguments of function fun()
Libraries / packages library(pkg)
Open package (library) pkg
Display description of package pkg
library(help=pkg)
2.
Vectors and data types
seq(-4,4,0.1)
2:7
c(5,7,9,1:3)
rep(1,5)
rep(4:6,1:3)
gl(3,2,12)
Generating
as.numeric(x)
as.character(x)
as.logical(x)
factor(x)
unlist(x)
Coercion
3.
Editing
data.frame(height,
weight)
dfr&var
attach(dfr)
detach()
dfr2 <- edit(dfr)
fix(dfr)
Summary
dim(dfr)
summary(dfr)
General
data(name)
read.table(file.txt)
Built-in data set
Read from external ASCII file
Arguments to
header = TRUE
First line has variable names
row.names = 1
sep = ,
sep = \t
dec = ,
na.strings = .
First column has row names
Data are separated by commas
Data are separated by tabs
Decimal point is comma
Missing value is dot
read.csv(file.csv)
Comma separated
read.delim(file.txt)
Tab delimited text file
Export
write.table()
see help(write.table) for details
Adding names
names()
dimnames()
Column names for data frame or list only
Row and column names, also for matrix
read.table()
Variants of
read.table()
5.
Indexing / selection / sorting
Vectors
Convert to numeric
Convert to text string
Convert to logical
Create factor from vector x
Convert list, result from table() etc. to vector
Data frames
Accessing data
Sequence: -4.0, -3.9, -3.8, ..., 3.9, 4.0
Same as seq(2,7,1)
Concatenation (vector): 5 7 9 1 2 3
11111
455666
Factor with 3 levels, repeat each level in blocks
of 2, up to length 12 (1 1 2 2 3 3 1 1 2 2 3 3)
Input and export of data
Collect vectors height and weight into
data frame
Select vector var in data frame dfr
Put data frame in search path
- and remove it from the path
open data frame dfr in spreadsheet, write
changed version into new data frame dfr2
open data frame dfr in spreadsheet,
changes will overwrite entries in dfr
Number of rows and columns in data frame
dfr, works also for matrices and arrays
Summary statistics for each variable in dfr
Matrices, data
frames
Sorting
x[1]
x[1:5]
x[c(2,3,5)]
x[y <= 30]
x[sex = = male]
i <-c(2,3,5); x[i]
k <- (y <=30); x[k]
length(x)
m[4, ]
First element
Subvector containing the first five elements
Elements nos. 2, 3, and 5
Selection by logical expression
Selection by factor variable
Selection by numerical variable
Selection by logical variable
Returns length of vector x
Fourth row
m[ ,3]
drf[drf$var <=30, ]
subset(dfr,var<=30)
m[m[ ,3]<=30, ]
Third column
Partial data frame (not for matrices)
Same, often simpler (not for matrices)
Partial matrix (also for data frames)
sort(c(7,9,10,6))
order(c(7,9,10,6))
Returns the sorted values: 6, 7, 9, 10
Returns the element number in order of
ascending values: 4, 1, 2, 3
same, but in order of decreasing values:
3, 2, 1, 4
Returns the ranks in order of ascending
values: 2, 3, 4, 1
order(c(7,9,10,6),
decreasing = TRUE)
rank(c(7,9,10,6))
BIO360 Biometrics I, Fall 2007
6.
H. Wagner, Biology UTM
Missing values
8.
Functions
is.na(x)
complete.cases(x1,x2,...)
Logical vector. TRUE where x has NA
Neither missing in x1, nor x2, nor ...
Arguments to
other functions
na.rm =
In statistical functions: Remove
missing if TRUE, returns NA if FALSE
In sort TRUE, FALSE and NA means
last, first, and discard
in lm(), etc., values na.fail,
na.last =
na.action =
na.omit, na.exclude
na.print =
na.strings =
7.
Statistical
log(x)
log(x, 10)
exp(x)
sin(x)
cos(x)
tan(x)
asin(x)
min(x)
min(x1, x2, ...)
max(x)
range(x)
pmin(x1, x2, ...)
Logarithm of x, natural logarithm
Base10 logarithm of x
Exponential function ex
Sine
Cosine
Tangent
Arcsin (inverse sine)
Smallest value in vector
minimum number over several vectors
Largest value in vector
Like c(min(x), max(x))
Parallel (elementwise) minimum over
multiple equally long vectors
length(x)
Number of elements in vector
sum(x)
Sum of values in vector
cumsum(x)
Cumulative sum of values in vector
sum(complete.cases(x)) Number of non-missing elements
mean(x)
median(x)
quantile(x, p)
var(x)
sd(x)
cor(x, y)
cor(x, y, method =
spearman)
Conditional
execution
Loop
User-defined
function
9.
Numerical functions
Mathematical
In summary() and print():
How to represent NA in output
In read.table():
Codes(s) for NA in input
Programming
if(p< 0.5)
print(Hooray)
Print Hooray if condition is true
if(p < 0.5)
{ print(Hooray)
i = i + 1 }
if(p < 0.5)
{ print(Hooray)}
else
{ i = i + 1}
for(i in 1:10)
{ print(i) }
i <- 1
while(i <= 10)
{ print(i)
i = i + 1 }
fun<- function(a, b,
doit = FALSE)
{ if(doit) {a + b}
else 0 }
If condition is true, perform all commands
within the curved brackets { }
Conditional execution with an alternative
Go through loop 10 times
Same, but more complicated
Defines a function fun that returns the
sum of a and b if the argument doit is
set to TRUE, or zero, if doit is FALSE
Operators
Arithmetic
+
*
/
^
% / %
% %
Addition
Subtraction
Multiplication
Division
Raise to the power of
Integer division: 5 %/% 3 = 1
Remainder from integer division: 5 %% 3 = 2
Logical or relational
=
!
<
>
<
>
Equal to
Not equal to
Less than
Greater than
Less than or equal to
Greater than or equal to
Missing?
Logical AND
Logical OR
Logical NOT
=
=
=
=
is.na(x)
&
|
!
Average
Median
Quantiles: median = quantile(x, 0.5)
Variance
Standard deviation
Pearson correlation
Spearman rank correlation
4
BIO360 Biometrics I, Fall 2007
H. Wagner, Biology UTM
10. Tabulation, grouping, recoding
General
table(x)
table(x, y)
xtabs(~ x + y)
factor(x)
cut(x, breaks)
Arguments to
levels = c()
factor()
labels = c()
exclude = c()
Arguments to
breaks = c()
cut()
labels = c()
Factor recoding
levels(f) <- names
factor(newcodes[f])
12. Statistical standard methods
Frequency table of vector (factor) x
Crosstabulation of x and y
Formula interface for crosstabulation:
use summary() for chi-square test
Convert vector to factor
Groups from cutpoints for continuous
variable, breaks is a vector of cutpoints
Values of x to code. Use if some values
are not present in data, or if the order
would be wrong.
Values associated with factor levels
Values to exclude. Default NA. Set to NULL
to have missing values included as a level.
Cutpoints. Note values of x outside of
breaks gives NA. Can also be a single
number, the number of cutpoints.
Names for groups. Default is 1, 2, ...
m1 % * % m2
t(m)
m[lower.tri(m)]
diag(m)
matrix(x, dim1, dim2)
Marginal
operations etc.
apply(m, dim, fun)
tapply(m, list(f1,
f2), fun)
split(x, f)
sapply(list, fun)
sapply(split(x,f),
fun)
Non-parametric
cor.test variant
Discrete response
New level names
Combining levels: newcodes, e.g.,
c(1,1,1,2,3) to amalgamate the first 3
of 5 groups of factor f
11. Manipulations of matrices and lists
Matrix algebra
Parametric tests,
continuous data
Matrix product
Matrix transpose
Returns the values from the lower triangle
of matrix m as a vector
Returns the diagonal elements of matrix m
Fill the values of vector x into a new
matrix with dim1 rows and dim2 columns,
Applies the function fun to each row
(dim = 1) or column (dim= 2) of matrix m
Can be used to aggregate columns or rows
within matrix m as defined by f1, f2, using
the function fun (e.g., mean, max)
Split vector, matrix or data frame by
factor x. Different results for matrix and
data frame! The result is a list with one
object for each level of f.
applies the function fun to each object in
a list, e.g. as created by the split function
t.test
One- and two-sample t-test
pairwise.t.test
cor.test
var.test
lm(y ~ x)
lm(y ~ f)
lm(y ~ x1 + x2 + x3)
lm(y ~ f1 * f2)
Pairwise comparison of means
Significance test for correlation coeff.
Comparison of two variances (F-test)
Regression analysis
One-way analysis of variance
Multiple regression
Two-way analysis of variance
wilcox.test
kruskal.test
friedman.test
method = spearman
One- and two-sample Wilcox test
Kruskal-Wallis test
Friedmans two-way analysis of variance
Spearman rank correlation
binom.test
prop.test
fisher.test
chisq.test
glm(y ~ x1+x2,
binomial)
Binomial test (incl. sign test)
Comparison of proportions
Exact test in 2 x 2 tables
Chi-square test of independence
Logistic regression
13. Statistical distributions
Normal
distribution
dnorm(x)
Density function
pnorm(x)
qnorm(p)
Cumulative distribution function P(X<=x)
p-quantile, returns x in: P(X<=x) = p
n random normally distributed numbers
rnorm(n)
Distributions
pnorm(x, mean, sd)
plnorm*x, mean, sd)
pt(x, df)
pf(x, n1, n2)
pchisq(x, df)
pbinom(x, n, p)
ppois(x, lambda)
punif(x, min, max)
pexp(x, rate)
pgamma(x, shape,
scale)
pbeta(x, a, b)
Normal
Lognormal
Students t distribution
F distribution
Chi-square distribution
Binomial
Poisson
Uniform
Exponential
Gamma
Beta
BIO360 Biometrics I, Fall 2007
H. Wagner, Biology UTM
Standard plots
=a+b+a:b
Linear models
Other models
-1
Remove intercept
lm.out <- lm(y ~ x)
summary(lm.out)
anova(lm.out)
fitted(lm.out)
resid(lm.out)
predict(lm.out,newdata)
Fit model and save results as lm.out
Coefficients etc.
Analysis of variance table
Fitted values
Residuals
Predictions for a new data frame
glm(y ~ x, binomial)
glm(y ~ x, poisson)
gam(y ~ s(x))
Logistic regression
Poisson regression
General additive model for non-linear
regression with smoothing. Package:
Plotting elements
(adding to a plot)
gam
Diagnostics
Survival analysis
Multivariate
rstudent(lm.out)
dfbetas(lm.out)
Classification (y = factor) or regression
(y = numeric) tree. Package: tree
dffits(lm.out)
Studentized residuals
Change in standardized regression
coefficients beta if observation removed
Change in fit if observation removed
S <- Surv(time,ev)
survfit(S)
plot(survfit(S))
survdiff(S ~ g)
coxph(S ~ x1 + x2)
Create survival object. Package: survival
Kaplan-Meier estimate
Survival curve
(Log-rank) test for equal survival curves
Coxs proportional hazards model
dist()
hclust()
kmeans()
rda()
Calculate Euclidean or other distances
Hierarchical cluster analysis
k-means cluster analysis
Perform principal component analysis
PCA or redundancy analysis RDA.
Package vegan.
Perform (canonical) correspondence
analysis, CA /CCA. Package: vegan
Calculate diversity indices. Pkg: vegan
cca()
lines()
Lines
abline()
points()
arrows()
box()
title()
text()
mtext()
legend()
pch
Regression line
Points
Arrows (NB: angle = 90 for error bars)
Frame around plot
Title (above plot)
Text in plot
Text in margin
List of symbols
Symbol (see below)
mfrow, mfcol
xlim, ylim
lty, lwd
col
Several plots in one (multiframe)
Plot limits
Line type / width (see below)
Color for lines or symbols (see below)
Point symbols (pch)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Colors (col)
Line types (lty)
1
2
3
4
5
6
7
8
black
red
green
blue
light blue
purple
yellow
grey
diversity()
Graphical pars.:
arguments to par()
plot(f, y)
hist()
boxplot()
barplot()
dotplot()
piechart()
interaction.plot()
Scatterplot (or other type of plot if x and
y are not numeric vectors)
Set of boxplots for each level of factor f
Histogram
Boxplot
Bar diagram
Dot diagram
Pie chart
Interaction plot (analysis of variance)
tree(y ~ x1+x2+x3)
plot(x, y)
As explained by
Additive effects
Interaction
Main effects + interaction: a*b
~
+
:
*
Model formulas
15. Graphics
14. Models