Indexing Exercises
Indexing Exercises
Charles DiMaggio
applications of epidemiologic methods II
Spring 2014
Indexing is the key to working with and manipulating R data. There are three ways to index
data in R:
• position
• logical vector
• name
Run the following to see an example of each type of indexing.
> x <- c(chol = 234, sbp = 148, dbp = 78, age = 54)
> x
> x[1] # by position
> x[x>150]# by logical
> x["chol"] # by name
You can use indexing to replace or change a data entry.
> x[1] <- 250 #by position
> x[x<100] <- NA # by logical
> x["sbp"] <- 150 # by name
> x
Let’s look at the three approaches to indexing in a bit more detail.
1 indexing vectors
1.1 by position
> x<-1:40
> x[11] #only the 11th element
> x[-11] #exclude the 11th element
1
> x[11:20] #members 11 to 20
> x[-(11:100)] # all but members 11 to 20
1.2 by logical
2
> mean(age) ; sd(age)
> agecat <- age # make copy
> #replace elements agecat with strings for q category
> agecat[age<15] <- "<15" # creating character vector
> agecat[age>=15 & age<25] <- "15-24"
> agecat[age>=25 & age<45] <- "25-44"
> agecat[age>=45 & age<65] <- "45-64"
> agecat[age>=65] <- "65+"
> table(agecat) # get freqs
3
the extra comma after 3 tells R to return all the rows in x for which the 1st column is <3
The functions lower.tri() and upper.tri() use indexing to return the positions below or above
a matrix.
> m2<-matrix(round(rnorm(81,50,5)),3,3)
> m2
> lower.tri(m2)
> upper.tri(m2)
> m2[lower.tri(m2)]
3 indexing lists
Indexing lists can sometimes be challenging. Recall the bracket notation for lists, where
double brackets refer to the ”bin” of like objects, and a following single bracket refers to the
contents of that bin.
> l<- list(1:5, matrix(1:4,2,2),
+ c("John Snow", "William Farr"))
1. by position
> l[[1]]
> l[[2]][2,1]
> l[[3]][2]
4
2. logical
> char <- sapply(l, is.character)
> char
> epi.folk<-l[char]
> epi.folk
Indexing lists comes in handy when working with the results of statistical models, which
frequently return results in the form of lists. Fortunately, most package authors return the
results as named lists.
Work through the following conditional logistic regression of abortion and infertility to see
an example of extracting list elements from the results of a model.
> data(infert)
> library(survival) # package with clogit()
> mod1 <- clogit(case ~ spontaneous + induced +
+ strata(stratum), data = infert)
> mod1 # default results (7x risk c spont AB, 4x induced)
> str(mod1)
> names(mod1) #structure, names
> mod1$coeff # name to index result (list element)
> summod1<-summary(mod1) #more detailed results
> names(summod1) #detailed list components
4 indexing dataframes
Data frames can (generally) be indexed like matrices, with the added advantage of being
able to use column (variable) names.
Run through this code to get a sense of how dataframes can be indexed.
data(infert)
1. position
infert[1:4, 1:2]
infert[1:4, 2] <- c(NA, 45, NA, 23)
infert[1:4, 1:2]
2. name
names(infert)
infert[1:4, c("education", "age")]
infert[1:4, c("age")] <- c(NA, 45, NA, 23)
5
infert[1:4, c("education", "age")]
3. logical
table(infert$parity)
# change values of 5 or 6 to missing
infert$parity[infert$parity==5 | infert$parity==6] <- NA
table(infert$parity)
table(infert$parity, exclude=NULL)
In the following perhaps more realistic example you will read in a set of anonymized hospital
discharge data, and then index it in various ways.
> sparcs<-read.csv(file="http://www.columbia.edu/~cjd11/charles_dimaggio/
+ DIRE/resources/R/sparcsShort.csv", stringsAsFactors=F)
• index rows
> brooklyn<-sparcs[sparcs$county=="59",]
> nyc<- sparcs$county=="58"| sparcs$county=="59"|
+ sparcs$county=="60"| sparcs$county=="61"| sparcs$county=="62"
> nyc.sparcs<-sparcs[nyc,]
• index columns
> dxs<-sparcs[,"pdx"]
> vars<-c("date", "pdx", "disp")
> my.vars<-sparcs[,vars]
• index rows and columns
> sparcs2<-sparcs[nyc,vars]
• variables to include
> brooklyn.sparcs<-subset(sparcs, county=="59",
+ select=c(date, pdx,disp))
• range of variables
> nyc.sparcs<-subset(sparcs, county=="59":"62",
+ select=c(county, pdx,disp))
• excluding rows
> nyc.sparcs<-subset(sparcs, county=="59":"62",
+ select=-c(county, pdx,disp))