R Programming
R Programming
http://www.r-project.org/ http://cran.r-project.org/
Hung Chen
Outline
Introduction:
Historical development S, Splus Capability Statistical Analysis Grouping, loops and conditional execution Function
Programming
R, S and S-plus
S: an interactive environment for data analysis developed at Bell Laboratories since 1976
1988 - S2: RA Becker, JM Chambers, A Wilks 1992 - S3: JM Chambers, TJ Hastie 1998 - S4: JM Chambers
Exclusively licensed by AT&T/Lucent to Insightful Corporation, Seattle WA. Product name: S-plus. Implementation languages C, Fortran. See: http://cm.bell-labs.com/cm/ms/departments/sia/S/history.html R: initially written by Ross Ihaka and Robert Gentleman at Dep. of Statistics of U of Auckland, New Zealand during 1990s. Since 1997: international R-core team of ca. 15 people with access to common CVS archive.
Introduction
R is GNU S A language and environment for data manipulation, calculation and graphical display.
R is similar to the award-winning S system, which was developed at Bell Laboratories by John Chambers et al. a suite of operators for calculations on arrays, in particular matrices, a large, coherent, integrated collection of intermediate tools for interactive data analysis, graphical facilities for data analysis and display either directly at the computer or on hardcopy a well developed programming language which includes conditionals, loops, user defined recursive functions and input and output facilities.
R and statistics
o Packaging: a crucial infrastructure to efficiently produce, load and keep consistent software libraries from (many) different sources / authors o Statistics: most packages deal with statistics and data analysis o State of the art: many statistical researchers provide their methods as R packages
R also has a large set of functions which provide a flexible graphical environment for creating various kinds of data presentations.
References
For R,
The basic reference is The New S Language: A Programming Environment for Data Analysis and Graphics by Richard A. Becker, John M. Chambers and Allan R. Wilks (the Blue Book) . The new features of the 1991 release of S (S version 3) are covered in Statistical Models in S edited by John M. Chambers and Trevor J. Hastie (the White Book). Classical and modern statistical techniques have been implemented.
Some of these are built into the base R environment. Many are supplied as packages. There are about 8 packages supplied with R (called standard packages) and many more are available through the cran family of Internet sites (via http://cran.r-project.org).
All the R functions have been documented in the form of help pages in an output independent form which can be used to create versions for HTML, LATEX, text etc.
The document An Introduction to R provides a more user-friendly starting point. An R Language Definition manual More specialized manuals on data import/export and extending R.
R as a calculator
> log2(32) [1] 5
sin(seq(0, 2 * pi, length = 100)) 1.0 -1.0 0 -0.5 0.0 0.5
20
40 Index
60
80
100
Object orientation
primitive (or: atomic) data types in R are: numeric (integer, double, complex) character logical function out of these, vectors, arrays, lists can be built.
Object orientation
Object: a collection of atomic variables and/or other objects that belong together Example: a microarray experiment probe intensities patient data (tissue location, diagnosis, follow-up) gene data (sequence, IDs, annotation) Parlance: class: the abstract definition of it object: a concrete instance method: other word for function slot: a component of an object
Object orientation
Advantages: Encapsulation (can use the objects and methods someone else has written without having to care about the internals) Generic functions (e.g. plot, print) Inheritance (hierarchical organization of complexity) Caveat: Overcomplicated, baroque program architecture
variables
> a = 49 > sqrt(a) [1] 7 > a = "The dog ate my homework" > sub("dog","cat",a) [1] "The cat ate my homework > a = (1+1==3) >a [1] FALSE
Lists
vector: an ordered collection of data of the same type. > a = c(7,5,1) > a[2] [1] 5 list: an ordered collection of data of arbitrary types. > doe = list(name="john",age=28,married=F) > doe$name [1] "john > doe$age [1] 28 Typically, vector elements are accessed by their index (an integer), list elements by their name (a character string). But both types support both access methods.
Data frames
data frame: is supposed to represent the typical data table that researchers come up with like a spreadsheet. It is a rectangular table with rows and columns; data within each column has the same type (e.g. number, text, logical), but different columns may have different types. Example: >a localisation tumorsize XX348 proximal 6.3 XX234 distal 8.0 XX987 proximal 10.0
Factors
A character string can contain arbitrary text. Sometimes it is useful to use a limited vocabulary, with a small number of allowed words. A factor is a variable that can only take such a limited number of values, which are called levels. >a [1] Kolon(Rektum) Magen Magen [4] Magen Magen Retroperitoneal [7] Magen Magen(retrogastral) Magen Levels: Kolon(Rektum) Magen Magen(retrogastral) Retroperitoneal > class(a) [1] "factor" > as.character(a) [1] "Kolon(Rektum)" "Magen" "Magen" [4] "Magen" "Magen" "Retroperitoneal" [7] "Magen" "Magen(retrogastral)" "Magen" > as.integer(a) [1] 1 2 2 2 2 4 2 3 2 > as.integer(as.character(a)) [1] NA NA NA NA NA NA NA NA NA NA NA NA Warning message: NAs introduced by coercion
Subsetting
Individual elements of a vector, matrix, array or data frame are accessed with [ ] by specifying their index, or their name >a localisation tumorsize progress XX348 proximal 6.3 0 XX234 distal 8.0 1 XX987 proximal 10.0 0 > a[3, 2] [1] 10 > a["XX987", "tumorsize"] [1] 10 > a["XX987",] localisation tumorsize progress XX987 proximal 10 0
>a
localisation tumorsize progress XX348 proximal 6.3 0 XX234 distal 8.0 1 XX987 proximal 10.0 0 > a[c(1,3),] localisation tumorsize progress XX348 proximal 6.3 0 XX987 proximal 10.0 0 > a[c(T,F,T),] localisation tumorsize progress XX348 proximal 6.3 0 XX987 proximal 10.0 0 > a$localisation [1] "proximal" "distal" "proximal" > a$localisation=="proximal" [1] TRUE FALSE TRUE > a[ a$localisation=="proximal", ] localisation tumorsize progress XX348 proximal 6.3 0 XX987 proximal 10.0 0
Subsetting
subset rows by a vector of indices subset rows by a logical vector
subset a column comparison resulting in logical vector subset the selected rows
Resources
A package specification allows the production of loadable modules for specific purposes, and several contributed packages are made available through the CRAN sites. CRAN and R homepage:
http://www.r-project.org/ It is Rs central homepage, giving information on the R project and everything related to it. http://cran.r-project.org/ It acts as the download area,carrying the software itself, extension packages, PDF manuals.
Getting help
Details about a specific command whose name you know (input arguments, options, algorithm, results): >? t.test or >help(t.test)
Getting help o HTML search engine o Search for topics with regular expressions: help.search
Probability distributions
Cumulative distribution function P(X x): p for the CDF Probability density function: d for the density,, Quantile function (given q, the smallest x such that P(X x) > q): q for the quantile simulate from the distribution: r
Distribution R name additional arguments beta beta shape1, shape2, ncp binomial binom size, prob Cauchy cauchy location, scale chi-squared chisq df, ncp exponential exp rate F f df1, df1, ncp gamma gamma shape, scale geometric geom prob hypergeometric hyper m, n, k log-normal lnorm meanlog, sdlog logistic logis; negative binomial nbinom; normal norm; Poisson pois; Students t t ; uniform unif; Weibull weibull; Wilcoxon wilcox
Control statements
if statements The language has available a conditional construction of the form if (expr 1) expr 2 else expr 3 where expr 1 must evaluate to a logical value and the result of the entire expression is then evident. a vectorized version of the if/else construct, the ifelse function. This has the form ifelse(condition, a, b)
Repetitive execution
for loops, repeat and while for (name in expr 1) expr 2 where name is the loop variable. expr 1 is a vector expression, (often a sequence like 1:20), and expr 2 is often a grouped expression with its sub-expressions written in terms of the dummy name. expr 2 is repeatedly evaluated as name ranges through the values in the vector result of expr 1. Other looping facilities include the repeat expr statement and the while (condition) expr statement. The break statement can be used to terminate any loop, possibly abnormally. This is the only way to terminate repeat loops. The next statement can be used to discontinue one particular cycle and skip to the next.
Branching
if (logical expression) { statements } else { alternative statements } else branch is optional
Loops
When the same or similar tasks need to be performed multiple times; for all elements of a list; for all columns of an array; etc.
Monte Carlo Simulation Cross-Validation (delete one and etc)
lapply(li, function ) To each element of the list li, the function function is applied. The result is a list whose elements are the individual function results. > li = list("klaus","martin","georg") > lapply(li, toupper) > [[1]] > [1] "KLAUS" > [[2]] > [1] "MARTIN" > [[3]] > [1] "GEORG"
apply
apply( arr, margin, fct ) Apply the function fct along some dimensions of the array arr, according to margin, and return a vector or array of the appropriate size. >x [,1] [,2] [,3] [1,] 5 7 0 [2,] 7 9 8 [3,] 4 6 7 [4,] 6 3 5 > apply(x, 1, sum) [1] 12 24 17 14 > apply(x, 2, sum) [1] 22 25 20
Storing data
Every R object can be stored into and restored from a file with the commands save and load. This uses the XDR (external data representation) standard of Sun Microsystems and others, and is portable between MSWindows, Unix, Mac. > save(x, file=x.Rdata) > load(x.Rdata)
Statistical models in R
Regression analysis
a linear regression model with independent homoscedastic errors
One-Way ANOVA
The model
Given a factor occurring at i =1,,I levels, with j = 1 ,,Ji observations per level. We use the model yij = + i + ij, i =1,,I , j = 1 ,,Ji
Not all the parameters are identifiable and some restriction is necessary:
Set =0 and use I different dummy variables. Set 1 = 0 this corresponds to treatment contrasts Set Jii = 0 ensure orthogonality
Two-Way Anova
The model yijk = + i + j + ()i j+ ijk.
We have two factors, at I levels and at J levels. Let nij be the number of observations at level i of and level j of and let those observations be yij1, yij2,. A complete layout has nij 1 for all i, j.
The interaction effect ()i j is interpreted as that part of the mean response not attributable to the additive effect of i and j.
For example, you may enjoy strawberries and cream individually, but the combination is superior. In contrast, you may like fish and ice cream but not together.
As of an investigation of toxic agents, 48 rats were allocated to 3 poisons (I,II,III) and 4 treatments (A,B,C,D).
The response was survival time in tens of hours. The Data:
Bootstrap
The bootstrap method mirrors the simulation method but uses quantities we do know.
Instead of sampling from the population distribution which we do not know in practice, we resample from the data itself.
Difficulty: is unknown and the distribution of is known. Solution: is replaced by its good estimate b and the distribution of is replaced by the residuals e1,,en.
1. Generate e* by sampling with replacement from e1,,en. 2. Form y* = X b + e*. 3. Compute b* from (X, y*).
Implementation
How do we take a sample of residuals with replacement?
sample() is good for generating random samples of indices: sample(10,rep=T) leads to 7 9 9 2 5 7 4 1 8 9
length(lottery.number) #254 breaks<- 100*(0:10); breaks[1]<- -1 hist(lottery.number,10,breaks) abline(256/10,0) (goodnes-of-fit test)
50 $500 1/1000 boxplot(lottery.payoff, main = "NJ Pick-it Lottery + (5/22/75-3/16/76)", sub = "Payoff") lottery.label<- NJ Pick-it Lottery (5/22/75-3/16/76) hist(lottery.payoff, main = lottery.label)
Data Analysis
$500 ?
outliers? min(lottery.payoff) # 83 lottery.number[lottery.payoff == min(lottery.payoff)] #123 # <, >, <=, >=, ==, != : max(lottery.payoff) # 869.5 lottery.number[lottery.payoff == max(lottery.payoff)] # 499
combination bets plot(a[1,],a[2,],xlab="lottery.number",ylab="lottery.payoff", main= "Payoff >=500") boxplot(split(lottery.payoff,lottery.number%/%100), sub= "Leading Digit of Winning Numbers", ylab= "Payoff")
qqplot(lottery.payoff, lottery3.payoff); abline(0,1) boxplot(lottery.payoff, lottery2.payoff, lottery3.payoff) $500
rbind(lottery2.number[lottery2.payoff >= 500],lottery2.payoff[lottery2.payoff >= 500]) rbind(lottery3.number[lottery3.payoff >= 500],lottery3.payoff[lottery3.payoff >= 500])
lottery2 (1976 11 10 1977 9 6 ) lottery3 (1980 12 1 1981 9 22 ) lottery.number<- scan("c:/lotterynumber.txt") lottery.payoff<- scan("c:/lotterypayoff.txt") lottery2<- scan("c:/lottery2.txt") lottery2<- matrix(lottery2,byrow=F,ncol=2)
In R, use help(faithful) to get more information on this data set. Load the data set by data(faithful). geyser<- matrix(scan("c:/geyser.txt"),byrow=F,ncol=2)
geyser.waiting<- geyser[,1]; geyser.duration<- geyser[,2] hist(geyser.waiting)
hist(geyser.waiting,freq=FALSE)
lines(density(geyser.waiting)) plot(density(geyser.waiting)) lines(density(geyser.waiting,bw=10)) lines(density(geyser.waiting,bw=1,kernel=e))
duration<- geyser.duration[1:298] waiting<- geyser.waiting[2:299] plot(duration,waiting,xlab=" ",ylab="waiting") plot(density(duration),xlab=" ",ylab="density") plot(density(geyser.waiting),xlab="waiting",ylab="density") # wt dt plot(geyser.waiting,geyser.duration,xlab="waiting", ylab="duration")
tube tube tube tube Rinehart (1969; J. Geophy. Res., 566573)
: dt wt+1 plot(duration,waiting,xlab= , ylab=waiting") : duration dt ( A
ts.plot(geyser.duration,xlab= ,ylab= )
B: dt+1 versus d
lag.plot(geyser.duration,1) 1: Second-Order Markov Chain
Explore Association
Data(stackloss)
It is a data frame with 21 observations on 4 variables. [,1] `Air Flow' Flow of cooling air [,2] `Water Temp' Cooling Water Inlet Temperature [,3] `Acid Conc.' Concentration of acid [per 1000, minus 500] [,4] `stack.loss' Stack loss The data sets `stack.x', a matrix with the first three (independent) variables of the data frame, and `stack.loss', the numeric vector giving the fourth (dependent) variable, are provided as well.
Explore Association
Boxplot suitable for showing a quantitative and a qualitative variable. The variable test is not quantitative but categorical.
Such variables are also called factors.