R Programming
R Programming
1. 1 Introduction
2. 2 R Basics
5. 5 Finding Help
6. 6 Control Structures
3. 6.1.3 If Statements
2. 6.2 Loops
7. 7 Functions
8. 8 Useful Utilities
9. 9 Running R Programs
Introduction
[ Slides ] [ R Code ]
General Overview
One of the main attractions of using the R (http://cran.at.r-project.org) environment is the ease with
which users can write their own programs and custom functions. The R programming syntax is
extremely easy to learn, even for users with no previous programming experience. Once the
basic R programming control structures are understood, users can use the R language as a
powerful environment to perform complex custom analyses of almost any type of data.
R Basics
The R & BioConductor manual provides a general introduction to the usage of the R
environment and its basic command syntax.
Finding Help
Reference list on R programming (selection)
R Programming for Bioinformatics, by Robert Gentleman
Advanced R, by Hadley Wickham
Control Structures
Conditional Executions
Comparison Operators
equal: ==
not equal: !=
Logical Operators
and: &
or: |
not: !
If Statements
If statements operate on length-one logical vectors.
Syntax
if(cond1=true) { cmd1 } else { cmd2 }
Example
if(1==0) {
print(1)
} else {
print(2)
}
[1] 2
Table of Contents
Avoid inserting newlines between '} else'.
Ifelse Statements
Ifelse statements operate on vectors of variable length.
Syntax
ifelse(test, true_value, false_value)
Example
x <- 1:10 # Creates sample data
ifelse(x<5 | x>8, x, 0)
[1] 1 2 3 4 0 0 0 0 9 10
Table of Contents
Loops
The most commonly used loop structures in R are for, while and apply loops. Less common
are repeat loops. The break function is used to break out of loops, and next halts the
processing of the current iteration and advances the looping index.
For Loop
For loops are controlled by a looping vector. In every iteration of the loop one value in the
looping vector is assigned to a variable that can be used in the statements of the body of the loop.
Usually, the number of loop iterations is defined by the number of values stored in the looping
vector and they are processed in the same order as they are stored in the looping vector.
Syntax
for(variable in sequence) {
statements
}
Example
mydf <- iris
myve <- NULL # Creates empty storage container
for(i in seq(along=mydf[,1])) {
myve <- c(myve, mean(as.numeric(mydf[i, 1:3]))) # Note: inject approach is
much faster than append with 'c'. See below for details.
}
myve
[1] 3.333333 3.100000 3.066667 3.066667 3.333333 3.666667 3.133333 3.300000
[9] 2.900000 3.166667 3.533333 3.266667 3.066667 2.800000 3.666667 3.866667
Table of Contents
Example: condition*
x <- 1:10
z <- NULL
for(i in seq(along=x)) {
if(x[i] < 5) {
z <- c(z, x[i] - 1)
} else {
z <- c(z, x[i] / x[i])
}
}
z
[1] 0 1 2 3 1 1 1 1 1 1
Table of Contents
While Loop
Similar to for loop, but the iterations are controlled by a conditional statement.
Syntax
while(condition) statements
Example
z <- 0
while(z < 5) {
z <- z + 2
print(z)
}
[1] 2
[1] 4
[1] 6
Table of Contents
Syntax
apply(X, MARGIN, FUN, ARGs)
X: array, matrix or data.frame; MARGIN: 1 for rows, 2 for columns, c(1,2) for both; FUN:
one or more functions; ARGs: possible arguments for function
Example
## Example for applying predefined mean function
apply(iris[,1:3], 1, mean)
[1] 3.333333 3.100000 3.066667 3.066667 3.333333 3.666667 3.133333 3.300000
...
Applies a function to array categories of variable lengths (ragged array). Grouping is defined by
factor.
Syntax
tapply(vector, factor, FUN)
Example
Syntax
lapply(X, FUN)
sapply(X, FUN)
Example
## Creates a sample list
mylist <- as.list(iris[1:3,1:3])
mylist
$Sepal.Length
[1] 5.1 4.9 4.7
$Sepal.Width
[1] 3.5 3.0 3.2
$Petal.Length
[1] 1.4 1.4 1.3
$Sepal.Width
[1] 9.7
$Petal.Length
[1] 4.1
Other Loops
Repeat Loop
Syntax
repeat statements
Loop is repeated until a break is specified. This means there needs to be a second statement to
test whether or not to break from the loop.
Example
z <- 0
repeat {
z <- z + 1
print(z)
if(z > 100) break()
}
Table of Contents
(1) Speed comparison of for loops with an append versus an inject step:
myMA <- matrix(rnorm(1000000), 100000, 10, dimnames=list(1:100000, paste("C",
1:10, sep="")))
results <- NULL
system.time(for(i in seq(along=myMA[,1])) results <- c(results,
mean(myMA[i,])))
user system elapsed
39.156 6.369 45.559
(2) Speed comparison of apply loop versus rowMeans for computing the mean for each row
in a large matrix:
system.time(myMAmean <- apply(myMA, 1, mean))
user system elapsed
1.452 0.005 1.456
(3) Speed comparison of apply loop versus vectorized approach for computing the standard
deviation of each row:
system.time(myMAsd <- apply(myMA, 1, sd))
user system elapsed
3.707 0.014 3.721
myMAsd[1:4]
1 2 3 4
0.8505795 1.3419460 1.3768646 1.3005428
myMAsd[1:4]
1 2 3 4
0.8505795 1.3419460 1.3768646 1.3005428
Table of Contents
The vector-based approach in the last step is over 200 times faster than the apply loop.
(4) Example for computing the mean for any custom selection of columns without compromising
the speed performance:
## In the following the colums are named according to their selection in
myList
myList <- tapply(colnames(myMA), c(1,1,1,2,2,2,3,3,4,4), list)
myMAmean <- sapply(myList, function(x) rowMeans(myMA[,x]))
colnames(myMAmean) <- sapply(myList, paste, collapse="_")
myMAmean[1:4,]
C1_C2_C3 C4_C5_C6 C7_C8 C9_C10
1 0.0676799 -0.2860392 0.09651984 -0.7898946
2 -0.6120203 -0.7185961 0.91621371 1.1778427
3 0.2960446 -0.2454476 -1.18768621 0.9019590
4 0.9733695 -0.6242547 0.95078869 -0.7245792
Functions
A very useful feature of the R environment is the possibility to expand existing functions and to
easily write custom functions. In fact, most of the R software can be viewed as a series of R
functions.
General
Functions are defined by (1) assignment with the keyword function, (2) the declaration of
arguments/variables (arg1, arg2, ...) and (3) the definition of operations
(function_body) that perform computations on the provided arguments. A function name
needs to be assigned to call the function (see below).
Naming
Function names can be almost anything. However, the usage of names of existing functions
should be avoided.
Arguments
It is often useful to provide default values for arguments (e.g.:arg1=1:10). This way they don't
need to be provided in a function call. The argument list can also be left empty (myfct <-
function() { fct_body }) when a function is expected to return always the same
value(s). The argument '...' can be used to allow one function to pass on argument settings to
another.
Function body
The actual expressions (commands/operations) are defined in the function body which should be
enclosed by braces. The individual commands are separated by semicolons or new lines
(preferred).
Calling functions
Functions are called by their name followed by parentheses containing possible argument names.
Empty parenthesis after the function name will result in an error message when a function
requires certain arguments to be provided by the user. The function name alone will print the
definition of a function.
Scope
Variables created inside a function exist only for the life time of a function. Thus, they are not
accessible outside of the function. To force variables in functions to exist globally, one can use
this special assignment operator: '<<-'. If a global variable is used in a function, then the global
variable will be masked only within the function.
myfct(2, 5) # the argument names are not necessary, but then the order of the
specified values becomes important
myfct(x1=2) # does the same as before, but the default value '5' is used in
this case
Table of Contents
Return
The evaluation flow of a function may be terminated at any stage with the return function.
This is often used in combination with conditional evaluations.
Stop
To stop the action of a function and print an error message, one can use the stop function.
Warning
To print a warning message in unexpected situations without aborting the evaluation flow of a
function, one can use the function warning("...").
myfct(x1=-2)
Error in myfct(x1 = -2) : This function did not finish, because x1 < 0
Table of Contents
Useful Utilities
Debugging Utilities
Several debugging utilities are available for R. The most important utilities
are: traceback(), browser(), options(error=recover), options(error=NUL
L) and debug(). The Debugging in Rpage provides an overview of the available resources.
Regular Expressions
R's regular expression utilities work similar as in other languages. To learn how to use them in R,
one can consult the main help page on this topic with ?regexp. The following gives a few
basic examples.
The grep function can be used for finding patterns in strings, here letter A in
vector month.name.
month.name[grep("A", month.name)]
[1] "April" "August"
Table of Contents
Example for using regular expressions to substitute a pattern by another one using
the sub/gsub function with a back reference. Remember: single escapes '\' need to be double
escaped '\\' in R.
gsub("(i.*a)", "xxx_\\1", "virginica", perl = TRUE)
[1] "vxxx_irginica"
Table of Contents
Example for importing specific lines in a file with a regular expression. The following example
demonstrates the retrieval of specific lines from an external file with a regular expression. First,
an external file is created with the cat function, all lines of this file are imported into a vector
with readLines, the specific elements (lines) are then retieved with the grep function, and
the resulting lines are split into vector fields with strsplit.
cat(month.name, file="zzz.txt", sep="\n")
x <- readLines("zzz.txt")
x <- x[c(grep("^J", as.character(x), perl = TRUE))]
t(as.data.frame(strsplit(x, "u")))
[,1] [,2]
c..Jan....ary.. "Jan" "ary"
c..J....ne.. "J" "ne"
c..J....ly.. "J" "ly"
Table of Contents
Interpreting Character String as Expression
Example
myresult <-
NULL
for(i in myentries)
{
res <-
readLines(x)
close(x)
print(myresult)
final
Pep MW
1 MKWVTFISLLFLFSSAYS 2139.11
2 MWVTFISLL 1108.60
3 MFISLLFLFSSAYS 1624.82
Table of Content
Running R Programs
(1) Executing an R script from the R console
source("my_script.R")
Table of Contents
(2.1) Syntax for running R programs from the command-line. Requires in first line
of my_script.R the following statement: #!/usr/bin/env Rscript
$ Rscript my_script.R # or just ./myscript.R after making file executable with
'chmod +x my_script.R'
All commands starting with a '$' sign need to be executed from a Unix or Linux shell.
(2.2) Alternatively, one can use the following syntax to run R programs in BATCH mode from
the command-line.
$ R CMD BATCH [options] my_script.R [outfile]
The output file lists the commands from the script file and their outputs. If no outfile is specified,
the name used is that of infile and .Rout is appended to outfile. To stop all the usual R
command line information from being written to the outfile, add this as first line
to my_script.R file: options(echo=FALSE). If the command is run like this R CMD
BATCH --no-save my_script.R, then nothing will be saved in the .Rdata file which
can get often very large. More on this can be found on the help pages: $ R CMD BATCH --
help or ?BATCH.
(2.3) Another alternative for running R programs as silently as possible.
$ R --slave < my_infile > my_outfile
######################
myarg <- commandArgs()
print(iris[1:myarg[6], ])
######################
Define S4 Classes
(A) Define S4 Classes with setClass() and new()
y <- matrix(1:50, 10, 5) # Sample data set
setClass(Class="myclass",
representation=representation(a="ANY"),
prototype=prototype(a=y[1:2,]), # Defines default value (optional)
validity=function(object) { # Can be defined in a separate step using
setValidity
if(class(object@a)!="matrix") {
return(paste("expected matrix, but obtained", class(object@a)))
} else {
return(TRUE)
}
}
)
Table of Contents
The setClass function defines classes. Its most important arguments are
Class: the name of the class
representation: the slots that the new class should have and/or other classes that
this class extends.
prototype: an object providing default data for the slots.
(C) A more generic way of creating class instances is to define an initialization method (details
below)
setMethod("initialize", "myclass", function(.Object, a) {
.Object@a <- a/a
.Object
})
new("myclass", a = y)
[1] "initialize"
new("myclass", a = y)> new("myclass", a = y)
An object of class "myclass"
Slot "a":
[,1] [,2] [,3] [,4] [,5]
[1,] 1 1 1 1 1
[2,] 1 1 1 1 1
...
Table of Contents
(E) Inheritance: allows to define new classes that inherit all properties (e.g. data slots, methods)
from their existing parent classes
setClass("myclass1", representation(a = "character", b = "character"))
setClass("myclass2", representation(c = "numeric", d = "numeric"))
setClass("myclass3", contains=c("myclass1", "myclass2"))
new("myclass3", a=letters[1:4], b=letters[1:4], c=1:4, d=4:1)
An object of class "myclass3"
Slot "a":
[1] "a" "b" "c" "d"
Slot "b":
[1] "a" "b" "c" "d"
Slot "c":
[1] 1 2 3 4
Slot "d":
[1] 4 3 2 1
getClass("myclass1")
Class "myclass1" [in ".GlobalEnv"]
Slots:
Name: a b
Class: character character
getClass("myclass2")
Class "myclass2" [in ".GlobalEnv"]
Slots:
Name: c d
Class: numeric numeric
getClass("myclass3")
Class "myclass3" [in ".GlobalEnv"]
Slots:
Name: a b c d
Class: character character numeric numeric
(G) Virtual classes are constructs for which no instances will be or can be created. They are used
to link together classes which may have distinct representations (e.g. cannot inherit from each
other) but for which one wants to provide similar functionality. Often it is desired to create a
virtual class and to then have several other classes extend it. Virtual classes can be defined by
leaving out the representation argument or including the class VIRTUAL:
setClass("myVclass")
setClass("myVclass", representation(a = "character", "VIRTUAL"))
Table of Contents
(H) Functions to introspect classes
getClass("myclass")
getSlots("myclass")
slotNames("myclass")
extends("myclass2")
(F) Define a graphical plotting function and allow user to access it with generic plot function
setMethod(f="plot", signature="myclass", definition=function(x, ...) {
barplot(as.matrix(acc(x)), ...)
})
plot(myobj)
Table of Contents
Building R Packages
To get familiar with the structure, building and submission process of R packages, users should
carefully read the documentation on this topic available on these sites:
Writing R Extensions, R web site
R Packages, by Hadley Wickham
R Package Primer, by Karl Broman
Package Guidelines, Bioconductor
Advanced R Programming Class, Bioconductor
(B) Once a package skeleton is available one can build the package from the command-line
(Linux/OS X):
$ R CMD build mypackage
Table of Contents
This will create a tarball of the package with its version number encoded in the file name,
e.g.: mypackage_1.0.tar.gz.
Subsequently, the package tarball needs to be checked for errors with:
$ R CMD check mypackage_1.0.tar.gz
Table of Contents
All issues in a package's source code and documentation should be addressed until R CMD
check returns no error or warning messages anymore.
Linux:
install.packages("mypackage_1.0.tar.gz", repos=NULL)
Table of Contents
OS X:
install.packages("mypackage_1.0.tar.gz", repos=NULL, type="source")
Table of Contents
Windows requires a zip archive for installing R packages, which can be most conveniently
created from the command-line (Linux/OS X) by installing the package in a local directory (here
tempdir) and then creating a zip archive from the installed package directory:
$ mkdir tempdir
$ R CMD INSTALL -l tempdir mypackage_1.0.tar.gz
$ cd tempdir
$ zip -r mypackage mypackage
Additional *.Rd help templates can be generated with the prompt*() functions like
this:
library(tools)
Rd2txt("./mypackage/man/myfct.Rd") # renders *.Rd files as they look in final
help pages
checkRd("./mypackage/man/myfct.Rd") # checks *.Rd help file for problems
Table of Contents
The best way of sharing an R package with the community is to submit it to one of the main R
package repositories, such as CRAN or Bioconductor. The details about the submission process
are given on the corresponding repository submission pages:
Submitting to Bioconductor (guidelines, submission, svn control, build/checks
release, build/checks devel)
Submitting to CRAN
R Programming Exercises
Exercise Slides
[ Slides ] [ Exercises ] [ Additional Exercises ]
Download on of the above exercise files, then start editing this R source file with a programming
text editor, such as Vim, Emacs or one of the R GUI text editors. Here is the HTML version of
the code with syntax coloring.
Sample Scripts
Batch Operations on Many Files
## (1) Start R from an empty test directory
## (2) Create some files as sample data
for(i in month.name) {
mydf <- data.frame(Month=month.name, Rain=runif(12, min=10, max=100),
Evap=runif(12, min=1000, max=2000))
write.table(mydf, file=paste(i , ".infile", sep=""), quote=F,
row.names=F, sep="\t")
}
## (3) Import created files, perform calculations and export to renamed files
files <- list.files(pattern=".infile$")
for(i in seq(along=files)) { # start for loop with numeric or character
vector; numeric vector is often more flexible
x <- read.table(files[i], header=TRUE, row.names=1, comment.char =
"A", sep="\t")
x <- data.frame(x, sum=apply(x, 1, sum), mean=apply(x, 1, mean)) #
calculates sum and mean for each data frame
assign(files[i], x) # generates data frame object and names it after
content in variable 'i'
print(files[i], quote=F) # prints loop iteration to screen to check
its status
write.table(x, paste(files[i], c(".out"), sep=""), quote=FALSE, sep="\
t", col.names = NA)
}
## (4) Same as above, but file naming by index data frame. This way one can
organize file names by external table.
name_df <- data.frame(Old_name=sort(files), New_name=sort(month.abb))
for(i in seq(along=name_df[,1])) {
x <- read.table(as.vector(name_df[i,1]), header=TRUE, row.names=1,
comment.char = "A", sep="\t")
x <- data.frame(x, sum=apply(x, 1, sum), mean=apply(x, 1, mean))
assign(as.vector(name_df[i,2]), x) # generates data frame object and
names it after 'i' entry in column 2
print(as.vector(name_df[i,1]), quote=F)
write.table(x, paste(as.vector(name_df[i,2]), c(".out"), sep=""),
quote=FALSE, sep="\t", col.names = NA)
}
## (6) Write the above code into a text file and execute it with the commands
'source' and 'BATCH'.
source("my_script.R") # execute from R console
$ R CMD BATCH my_script.R # execute from shell
Table of Contents
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/
lsArray.R")
Table of Contents
To demonstrate the utilities of the script, users can simply execute it from R with the following source
command:
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/
sequenceAnalysis.txt")
Table of Contents
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/
patternSearch.R")
Table of Contents
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/
wordFinder.R")
Table of Contents
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/
translateDNA.R")
Table of Contents
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/
sdfSubset.R")
Table of Contents
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/
BibTex.R")
Table of Contents
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/
mortgage.R")
Table of Contents
## Write each character of sequence into separate vector field and reverse its
order
my_split <- strsplit(as.character(z1[1,2]),"")
my_rev <- rev(my_split[[1]])
paste(my_rev, collapse="")