Apply Functions
Apply Functions
G Asha
2022-08-24
What are apply functions? Apply functions are a family of functions in base R which allow you to repetitively
perform an action on multiple chunks of data. An apply function is essentially a loop, but run faster than
loops and often require less code.
The apply functions that this chapter will address are apply, lapply, sapply, vapply, tapply, and mapply.
There are so many different apply functions because they are meant to operate on different types of data.
#The apply function First, let’s go over the basic apply function. You can use the help section to get a
description of this function.
?apply the apply function looks like this: apply(X, MARGIN, FUN).
X is an array or matrix (this is the data that you will be performing the function on) Margin specifies whether
you want to apply the function across rows (1) or columns (2) FUN is the function you want to use
2.1 apply examples Here is a 4 × 5 matrix of random integers from a Poisson distribution with mean 1.5
X <- matrix(rpois(20,1.5),nrow=4)
trails =c("Tr1","Tr2","Tr3","Tr4")
rownames(X) <-trails
#rownames(X,do.NULL=FALSE,prefix="Trial.")
drug.names <- c("aspirin", "paracetamol", "nurofen", "hedex", "placebo")
colnames(X) <- drug.names
X
## [1] 1.25
The variance of the bottom row, calculated over all of the columns (a blank in the second position)
var(X[4,])
## [1] 0.8
There are some special functions for calculating summary statistics on matrices:
rowSums(X)
1
## Tr1 Tr2 Tr3 Tr4
## 1.6 1.2 1.2 1.4
colSums(X)
2
[1] 10 26 42 58 74 90 Note that in both cases, the answer produced by apply is a vector rather than a matrix.
You can apply functions to the individual elements of the matrix rather than to the margins. The margin
you specify influences only the shape of the resulting matrix.
apply(X,1,sqrt)
3
X1 <- apply(X,1:2, function(x) x+3)
X1
## [1] 1 2 3 4 5 6 7 8 9 10
#apply(vec, 1, sum)# will not work on vectors
If you run this function it will return the error: Error in apply(v, 1, sum) : dim(X) must have a positive
length. As you can see, this didn’t work because apply was expecting the data to have at least two dimensions.
If your data is a vector you need to use lapply, sapply, or vapply instead.
#lapply, sapply, and vapply commands lapply, sapply, and vapply are all functions that will loop a function
through data in a list or vector. First, try looking up lapply in the help section to see a description of all
three function.
?lapply Here are the agruments for the three functions:
lapply(X, FUN, . . . ) sapply(X, FUN, . . . , simplify = TRUE, USE.NAMES = TRUE) vapply(X, FUN,
FUN.VALUE, . . . , USE.NAMES = TRUE) In this case, X is a vector or list, and FUN is the function you
want to use. sapply and vapply have extra arguments, but most of them have default values, so you don’t
need to worry about them. However, vapply requires another agrument called FUN.VALUE, which we will
look at later.
Example 1: Getting started with lapply Earlier, we created the vector vec. Let’s use that vector to test out
the lapply function.
lapply(vec, sum)
## [[1]]
## [1] 1
##
## [[2]]
## [1] 2
##
## [[3]]
## [1] 3
##
## [[4]]
## [1] 4
##
## [[5]]
## [1] 5
##
## [[6]]
## [1] 6
##
4
## [[7]]
## [1] 7
##
## [[8]]
## [1] 8
##
## [[9]]
## [1] 9
##
## [[10]]
## [1] 10
This function didn’t add up the values like we may have expected it to. This is because lapply applies treats
the vector like a list, and applies the function to each point in the vector.
Let’s try using a list instead
A<-c(1:9)
B<-c(1:12)
C<-c(1:15)
my.lst<-list(A,B,C)
lapply(my.lst, sum)
## [[1]]
## [1] 45
##
## [[2]]
## [1] 78
##
## [[3]]
## [1] 120
[[1]]
[1] 45
[[2]]
[1] 78
[[3]]
[1] 120
This time, the lapply function seemed to work better. The function summed each vector in the list and
returned a list of the 3 sums.
Example 2: sapply sapply works just like lapply, but will simplify the output if possible. This means that
instead of returning a list like lapply, it will return a vector instead if the data is simplifiable.
sapply(vec, sum)
## [1] 1 2 3 4 5 6 7 8 9 10
sapply(my.lst, sum)
5
## [1] 45 78 120
See how these two examples gave the same answers, but returned a vector instead?
Example 3: vapply vapply is similar to sapply, but it requires you to specify what type of data you are
expecting the arguments for vapply are vapply(X, FUN, FUN.VALUE). FUN.VALUE is where you specify
the type of data you are expecting. I am expecting each item in the list to return a single numeric value, so
FUN.VALUE = numeric(1).
vapply(vec, sum, numeric(1))
## [1] 1 2 3 4 5 6 7 8 9 10
## [1] 1 2 3 4 5 6 7 8 9 10
vapply(my.lst, sum, numeric(1))
## [1] 45 78 120
## [1] 45 78 120
If your function were to return more than one numeric value, FUN.VALUE = numeric(1) will cause the
function to return an error. This could be useful if you are expecting only one result per subject.
#Transforming data with sapply
Like apply, these functions can also be used for transforming data inside the list
my.lst2 <- sapply(my.lst, function(x) x*2)
my.lst2
## [[1]]
## [1] 2 4 6 8 10 12 14 16 18
##
## [[2]]
## [1] 2 4 6 8 10 12 14 16 18 20 22 24
##
## [[3]]
## [1] 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
Which function should I use, lapply, sapply, or vapply? If you are trying to decide which of these three functions
to use, because it is the simplest, I would suggest to use sapply if possible. If you do not want your results to
be simplified to a vector, lapply should be used. If you want to specify the type of result you are expecting, use
vapply. ******************************************************************************* #tapply
Sometimes you may want to perform the apply function on some data, but have it separated by factor. In
that case, you should use tapply. Let’s take a look at the information for tapply.
?tapply The arguments for tapply are tapply(X, INDEX, FUN). The only new argument is INDEX, which is
the factor you want to use to separate the data.
Examole: For the matrix X, you might want to sum groups of rows within columns, and rowsum (singular
and all lower case, in contrast to rowSums, above) is a very efficient function for this. In this example, we
want to group together row 1 and row 4 (as group A) and row 2 and row 3 (group B). Note that the grouping
vector has to have length equal to the number of rows:
group=c("A","B","B","A")
#rowsum(X, group)
## 1 2 3 4 5 6
## A 6 1 2 3 3 3.1
6
## B 2 6 2 0 2 3.4
12
3 8tdata
Example 2: Combining functions You can use tapply to do some quick summary statistics on a variable
split by condition. In this example, I created a function that returns a vector ofboth the mean and standard
deviation. You can create a function like this for any apply function, not just tapply.
mapply
The last apply function I will cover is mapply.
?mapply the arguments for mapply are mapply(FUN, . . . , MoreArgs = NULL, SIMPLIFY = TRUE,
USE.NAMES = TRUE). First you list the function, followed by the vectors you are using the rest of the
arguments have default values so they don’t need to be changed for now. When you have a function that
takes 2 arguments, the first vector goes into the first argument and the second vector goes into the second
argument. Example 1: Understanding mapply In this example, 1:9 is specifying the value to repeat, and 9:1 is
specifying how many times to repeat. This order is based on the order of arguments in the rep function itself.
mapply(rep, 1:9, 9:1)
## [[1]]
## [1] 1 1 1 1 1 1 1 1 1
##
## [[2]]
## [1] 2 2 2 2 2 2 2 2
##
## [[3]]
## [1] 3 3 3 3 3 3 3
##
## [[4]]
## [1] 4 4 4 4 4 4
##
## [[5]]
## [1] 5 5 5 5 5
##
## [[6]]
## [1] 6 6 6 6
##
## [[7]]
## [1] 7 7 7
##
## [[8]]
## [1] 8 8
##
## [[9]]
## [1] 9
Example 2: Creating a new variable Another use for mapply would be to create a new variable.
Example 3: Saving data into a premade vector When using an apply family function to create a new
variable, one option is to create a new vector ahead of time with the size of the vector pre-allocated.
******************************************************************************** #Using apply
functions on real datasets This last section will be a few examples of using apply functions on real data.This
7
section will make use of the MASS package, which is a collection of publicly available datasets. Please install
MASS if you do not already have it. If you do not have MASS installed, you can uncomment the code below.
#install.packages(“MASS”)
library(MASS)
Let’s look at the data we will be using. We will be using the state.x77 dataset
head(state.x77) ## Population Income Illiteracy Life Exp Murder HS Grad Frost ## Alabama 3615 3624
2.1 69.05 15.1 41.3 20 ## Alaska 365 6315 1.5 69.31 11.3 66.7 152 ## Arizona 2212 4530 1.8 70.55 7.8
58.1 15 ## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 ## California 21198 5114 1.1 71.71 10.3 62.6 20 ##
Colorado 2541 4884 0.7 72.06 6.8 63.9 166 ## Area ## Alabama 50708 ## Alaska 566432 ## Arizona
113417 ## Arkansas 51945 ## California 156361 ## Colorado 103766 str(state.x77) ## num [1:50, 1:8]
3615 365 2212 2110 21198 . . . ## - attr(*, “dimnames”)=List of 2 ## ..$ : chr [1:50] “Alabama” “Alaska”
“Arizona” “Arkansas” . . . ## ..$ : chr [1:8] “Population” “Income” “Illiteracy” “Life Exp” . . . All the data
in the dataset happens to be numeric, which is necessary when the function inside the apply function requires
numeric data.
Example 1: using apply to get summary data You can use apply to find measures of central tendency and
dispersion
apply(state.x77, 2, mean)
8
state.range <- apply(state.x77, 2, function(x) c(min(x), median(x), max(x)))
state.range
## $Northeast
## [1] 472 3100 18076
##
## $South
## [1] 579.0 3710.5 12237.0
##
## $`North Central`
## [1] 637 4255 11197
##
## $West
## [1] 365 1144 21198