0% found this document useful (0 votes)
1 views127 pages

Functions and Flow Control

The document provides an overview of user-defined functions in R, including their syntax, handling multiple outputs, default arguments, and data types. It also covers control structures such as loops, if-else statements, and switch statements, along with applications in curve fitting, solving equations, calculus, and optimization. Additionally, it emphasizes the importance of argument validation and provides examples for practical implementation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views127 pages

Functions and Flow Control

The document provides an overview of user-defined functions in R, including their syntax, handling multiple outputs, default arguments, and data types. It also covers control structures such as loops, if-else statements, and switch statements, along with applications in curve fitting, solving equations, calculus, and optimization. Additionally, it emphasizes the importance of argument validation and provides examples for practical implementation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 127

Flow Control and Functions

Presidency University

November, 2024
User Defined Functions

The basic format of the code is

function_name <- function (arguments)


{
main computation to be done
}
User Defined Functions

The basic format of the code is

function_name <- function (arguments)


{
main computation to be done
}

testfunction <- function(x,y) #----define a function


{
x*y
}

testfunction(2,5) #-----call the function with the arguments 2 and 5

[1] 10
Doing more than one computation

I When a function performs more than one task and gives


multiple objects return() is used to get all the outputs in a
form of a vector:

testfunction <-function(x,y)
{
prod=x*y
su= x+y
return(c(prod,su))
}

testfunction(2,5)

[1] 10 7
Doing more than one computation

I Note that the two output can be accessed separately as

result=testfunction(2,5)
result[1]

[1] 10

result[2]

[1] 7
Doing more than one computation
I Alternatively multiple output can be extracted using list(). This will
enable us to extract by names (along with indices)
testfunction <- function(x,y)
{
prod=x*y
su= x+y
output=list(prod,su) #--- Creates the list
names(output)=c("Product", "Sum") #--- name them (optional)
return(output) #---- returns the list
}

result=testfunction(2,5)
result

$Product
[1] 10

$Sum
[1] 7

result$Product #----result[[1]] give same output

[1] 10

result$Sum #---- result[[2]] give similar output

[1] 7
Default argument of a function
I R provides method to define the default value of the arguments
while defining the function. These default values will be used when
the function is called unless this argument values are changed
during calling.
testfunction <- function(x=1,y=1)
{
prod=x*y
su= x+y
output=list(prod,su) #--- Creates the list
names(output)=c("Product", "Sum") #--- name them (optional)
return(output) #---- returns the list
}
testfunction() #--call with no argument

$Product
[1] 1

$Sum
[1] 2

testfunction(x=4)

$Product
[1] 4

$Sum
[1] 5
Additional Arguments
I Provision for additional arguments ( probably optional arguments,
which cannot be decided beforehand) can be done using ...

testfunction <- function(x=1,y=1,...)


{
prod=x*y
su= x+y
output=list(prod,su) #--- Creates the list
names(output)=c("Product", "Sum") #--- name them (optional)
return(output) #---- returns the list
}

#--z is extra argument which is of no use here


testfunction(2,5,z=12)

$Product
[1] 10

$Sum
[1] 7
Data types of arguments
I Since the types of the arguments are not specified (at the time of
definition), the arguments can be of any data type provided the
internal code of the function is conformable with that data
type

testfunction <- function(x=1,y=1,...)


{
prod=x*y
su= x+y
output=list(prod,su) #--- Creates the list
names(output)=c("Product", "Sum") #--- name them (optional)
return(output) #---- returns the list
}

testfunction(c(1,2),c(3,4)) #--calling with vectors

$Product
[1] 3 8

$Sum
[1] 4 6

testfunction("F","M") #---calling with characters

Error in x * y: non-numeric argument to binary operator


Sanity checking argument

I So how can we stop a function if the user calls it with


non-conformable arguments? A good practice is to write functions
in such a way that while calling, it checks whether the arguments
supplied make sense before going to the main body of the function.

testfunction= function(x,y)
{
#---check if the arguments are not characters
stopifnot( typeof(x)!="character", typeof(y)!="character" )
prod=x*y
su= x+y
output=list(prod,su) #--- Creates the list
names(output)=c("Product", "Sum") #--- name them (optional)
return(output)
}
testfunction("F","M")

Error in testfunction("F", "M"): typeof(x) != "character" is not TRUE

I The stopifnot function halts the execution of the function (with


error message) if all of its arguments do not evaluate to TRUE.
Applications: Curve fitting
I Any function can be plotted using curve(function, from, to, n, add=T/F,...) where from and to
are range over which the function is plotted and n (integer) is the number of points at which we
evaluate. add=TRUE / FALSE indicates whether to add this curve to a existing plot or not.

myfun= function(x) x*(1-x) #----in single line braces not required


curve(myfun, from= 0, to=1) #-other arguments take default values
myfun(x)

0.15
0.00

0.0 0.2 0.4 0.6 0.8 1.0


Example: Plotting normal curve

curve(dnorm,from=-4,to=4,n=500) #---dnorm gives pdf of N(0,1)


0.0 0.1 0.2 0.3 0.4
dnorm(x)

−4 −2 0 2 4

x
Example: Does lim sin x1 exist?
x→0

curve( sin(1/x), from=-2, to=2)

Warning in sin(1/x): NaNs produced


0.0 0.5 1.0
sin(1/x)

−1.0

−2 −1 0 1 2

x
Example (Contd.): Zoom at the origin

curve( sin(1/x), from=-0.1, to=0.1)

Warning in sin(1/x): NaNs produced


0.0 0.5 1.0
sin(1/x)

−1.0

−0.10 −0.05 0.00 0.05 0.10

x
Applications: Solving Equation
I For equations involving one variable we can use uniroot( function, interval,.....)
I For solving e x = sin(x) we write

uniroot( function(x) exp(x)-sin(x), c(-5,5))

$root
[1] -3.183063

$f.root
[1] -1.359327e-08

$iter
[1] 8

$init.it
[1] NA

$estim.prec
[1] 6.103516e-05

#-note how we write f(x) in one line

I For finding real or complex roots of a ploynomial use polyroot() and for solving
roots of n non-linear equations we can use multiroot() in package rootSolve
Applications: Calculus

´1
I Definite integral can be done using integrate() .e.g. (x)dx can be done
0
using

integrate (function(x) x, 0, 1)

0.5 with absolute error < 5.6e-15

integrate (dnorm, -3, 3)

0.9973002 with absolute error < 9.3e-07


Applications: Calculus

´1
I Definite integral can be done using integrate() .e.g. (x)dx can be done
0
using

integrate (function(x) x, 0, 1)

0.5 with absolute error < 5.6e-15

integrate (dnorm, -3, 3)

0.9973002 with absolute error < 9.3e-07

I Expression for derivatives can be obtained using deriv()


Applications: Optimization

I Maximum or Minimum value of a function can be found using


optimize( function, interval, maximum= TRUE/FALSE)

optimize(function(x) exp(-x),c(0,5))

$minimum
[1] 4.999936

$objective
[1] 0.006738379
Applications: Optimization

I Maximum or Minimum value of a function can be found using


optimize( function, interval, maximum= TRUE/FALSE)

optimize(function(x) exp(-x),c(0,5))

$minimum
[1] 4.999936

$objective
[1] 0.006738379

I There are other functions for optimization like optim(),


nlm(), constrOptim()
Loops in R
I Loops helps to repeat a job. We first start with for loop.
Loops in R
I Loops helps to repeat a job. We first start with for loop.

I The syntax is
for ( variable in sequence)
{
expression to be evaluated
}
Loops in R
I Loops helps to repeat a job. We first start with for loop.

I The syntax is
for ( variable in sequence)
{
expression to be evaluated
}

I Here seq is an expression which evaluates to a vector (not


necessarily in A.P.)
Loops in R
I Loops helps to repeat a job. We first start with for loop.

I The syntax is
for ( variable in sequence)
{
expression to be evaluated
}

I Here seq is an expression which evaluates to a vector (not


necessarily in A.P.)

I For example all the following are valid


for ( i in 1:10)
for ( i in c(2,3,7,9,13,17,19,23))
for ( i in c(“A”, “B”, “C”))
Loops in R
I Loops helps to repeat a job. We first start with for loop.

I The syntax is
for ( variable in sequence)
{
expression to be evaluated
}

I Here seq is an expression which evaluates to a vector (not


necessarily in A.P.)

I For example all the following are valid


for ( i in 1:10)
for ( i in c(2,3,7,9,13,17,19,23))
for ( i in c(“A”, “B”, “C”))

I The no. of times the expression in loop is evaluated is the


length of the sequence.
While Loop

I The syntax is
while ( condition )
{
expression to be evaluated
}
While Loop

I The syntax is
while ( condition )
{
expression to be evaluated
}

I The loop repeats its action untill the test condition is not
satisfied.
While Loop

I The syntax is
while ( condition )
{
expression to be evaluated
}

I The loop repeats its action untill the test condition is not
satisfied.

I Unlike for loop we need not to know in advance how many


times the loop will repeat.
If and If-Else

I The syntax for if - statement is


if ( condition)
{
expression
}
If and If-Else

I The syntax for if - statement is


if ( condition)
{
expression
}

I For a binary situation we can use if-else


if (condition)
{
expression 1
}
else
{
expression 2
}
If-Else function

I An alternative better way of if-else statements is ifelse()


function.
If-Else function

I An alternative better way of if-else statements is ifelse()


function.

I The syntax is
new variable= ifelse( Some Condition , Value of new
variable if condition is true, value if condition is false)
If-Else function

I An alternative better way of if-else statements is ifelse()


function.

I The syntax is
new variable= ifelse( Some Condition , Value of new
variable if condition is true, value if condition is false)

I e.g. category= ifelse ( marks > 80, “Good”, ”Fair” )


assigns value Good if marks is more than 80 and otherwise
Fair.
If-Else function

I An alternative better way of if-else statements is ifelse()


function.

I The syntax is
new variable= ifelse( Some Condition , Value of new
variable if condition is true, value if condition is false)

I e.g. category= ifelse ( marks > 80, “Good”, ”Fair” )


assigns value Good if marks is more than 80 and otherwise
Fair.

I The additional advantage is in the condition this function can


compare a vector with scalar (interpreted as each element
compared to the scalar)
Else if Ladder

I When we have more than two cases we can use else-if ladder
Else if Ladder

I When we have more than two cases we can use else-if ladder

f=function(x)
{
if (x==1) print(a)
else if(x==2) print(b)
else print(c)
}
Switch Statement

I An alternative and faster way is switch() statement.


Switch Statement

I An alternative and faster way is switch() statement.

I The basic syntax is switch( statement, list)


Switch Statement

I An alternative and faster way is switch() statement.

I The basic syntax is switch( statement, list)

I Here statement is evaluated and based on this value, the


corresponding item in the list is returned.
Switch Statement

I An alternative and faster way is switch() statement.

I The basic syntax is switch( statement, list)

I Here statement is evaluated and based on this value, the


corresponding item in the list is returned.

I e.g. switch(2 , “A”, “B”, “C”) gives the answer “B”. It selects
the item no. 2 from the list.
Switch Statement

I An alternative and faster way is switch() statement.

I The basic syntax is switch( statement, list)

I Here statement is evaluated and based on this value, the


corresponding item in the list is returned.

I e.g. switch(2 , “A”, “B”, “C”) gives the answer “B”. It selects
the item no. 2 from the list.

I switch(4 , “A”, “B”, “C”) gives NULL as there is no item


with index 4 in the list.
Switch Statement

I An alternative and faster way is switch() statement.

I The basic syntax is switch( statement, list)

I Here statement is evaluated and based on this value, the


corresponding item in the list is returned.

I e.g. switch(2 , “A”, “B”, “C”) gives the answer “B”. It selects
the item no. 2 from the list.

I switch(4 , “A”, “B”, “C”) gives NULL as there is no item


with index 4 in the list.

I switch( “color”, “color”=”red”, “shape”=”round”,


“length”=5) gives answer red (it matches the string).
Example

stat= function( x, type)


{
switch ( type, "mean"=mean(x), "median"=median(x), "sd"=sd(x))
} #----function ends here
stat(1:10, "mean") #call the function with mean

[1] 5.5

stat(1:9, "median") #call the function with median

[1] 5
Repeat Loop

I Basic syntax is
repeat
{
expression to be evaluated
}
Repeat Loop

I Basic syntax is
repeat
{
expression to be evaluated
}

I No default way of termination.


Repeat Loop

I Basic syntax is
repeat
{
expression to be evaluated
}

I No default way of termination.

I We need to manually terminate the loop using break


statement.
Repeat Loop

I Basic syntax is
repeat
{
expression to be evaluated
}

I No default way of termination.

I We need to manually terminate the loop using break


statement.

x=1 #---Take any value x as 1


repeat
{ #--Loop begins here
print(x)
x=x+1
if (x==6) break #--manual instruction to exit loop
} #---Loop ends here
x
Example: Fitting a Model

I Bigger cities tend to produce more economically per capita.


One proposed statistical model is

Y = y0 N a + 

where Y is the per-capita “gross metropolitan product” of a


city, N is its population, and y0 and a are parameters and  is
the random error.
gmp <- read.table("gmp.dat")
gmp$pop <- gmp$gmp/gmp$pcgmp
plot(pcgmp~pop, data=gmp, log="x", xlab="Population",
ylab="Per-Capita Economic Output ($/person-year)",
main="US Metropolitan Areas, 2006")
curve(6611*x^(1/8),add=TRUE,col="blue")
US Metropolitan Areas, 2006
80000
Per−Capita Economic Output ($/person−year)

60000
40000
20000

5e+04 1e+05 2e+05 5e+05 1e+06 2e+06 5e+06 1e+07 2e+07

Population
I Suppose we choose y0 = 6611. We want to fit the model

Y = y0 N a + 

by minimizing MSE (a) = (Yi − y0 Nia )2 w.r.t. a.


P
I Suppose we choose y0 = 6611. We want to fit the model

Y = y0 N a + 

by minimizing MSE (a) = (Yi − y0 Nia )2 w.r.t. a.


P

I But how do we take the derivative w.r.t. a?


I Suppose we choose y0 = 6611. We want to fit the model

Y = y0 N a + 

by minimizing MSE (a) = (Yi − y0 Nia )2 w.r.t. a.


P

I But how do we take the derivative w.r.t. a?

I Compute that numerically by

0 MSE (a + h) − MSE (a)


MSE (a) ≈
h
0
at+1 − at ∝ −MSE (a)
First Attempt
maximum.iterations <- 100
deriv.step <- 1/1000
step.scale <- 1e-12
stopping.deriv <- 1/100
iteration <- 0
deriv <- Inf
a <- 0.15
while ((iteration < maximum.iterations) && (deriv > stopping.deriv)) {
iteration <- iteration + 1
mse.1 <- mean((gmp$pcgmp - 6611*gmp$pop^a)^2)
mse.2 <- mean((gmp$pcgmp - 6611*gmp$pop^(a+deriv.step))^2)
deriv <- (mse.2 - mse.1)/deriv.step
a <- a - step.scale*deriv
}
list(a=a,iterations=iteration,converged=(iteration < maximum.iterations))

$a
[1] 0.1258166

$iterations
[1] 58

$converged
[1] TRUE
What’s wrong with this?

I Not encapsulated: Re-run by cutting and pasting code — but


how much of it? Also, hard to make part of something larger
I Inflexible: To change initial guess at a, have to edit, cut,
paste, and re-run
I Error-prone: To change the data set, have to edit, cut, paste,
re-run, and hope that all the edits are consistent
I Hard to fix: should stop when absolute value of derivative is
small, but this stops when large and negative. Imagine having
five copies of this and needing to fix same bug on each.
What’s wrong with this?

I Not encapsulated: Re-run by cutting and pasting code — but


how much of it? Also, hard to make part of something larger
I Inflexible: To change initial guess at a, have to edit, cut,
paste, and re-run
I Error-prone: To change the data set, have to edit, cut, paste,
re-run, and hope that all the edits are consistent
I Hard to fix: should stop when absolute value of derivative is
small, but this stops when large and negative. Imagine having
five copies of this and needing to fix same bug on each.

I Will turn this into a function and then improve it


estimate.scaling.exponent.1 <- function(a) {
maximum.iterations <- 100
deriv.step <- 1/1000
step.scale <- 1e-12
stopping.deriv <- 1/100
iteration <- 0
deriv <- Inf
while ((iteration < maximum.iterations) && (abs(deriv) > stopping.deriv)) {
iteration <- iteration + 1
mse.1 <- mean((gmp$pcgmp - 6611*gmp$pop^a)^2)
mse.2 <- mean((gmp$pcgmp - 6611*gmp$pop^(a+deriv.step))^2)
deriv <- (mse.2 - mse.1)/deriv.step
a <- a - step.scale*deriv
}
fit <- list(a=a,iterations=iteration,
converged=(iteration < maximum.iterations))
return(fit)
}
I Problem: All those magic numbers!
I Solution: Make them defaults
Third Attempt

estimate.scaling.exponent.2 <- function(a, y0=6611,


maximum.iterations=100, deriv.step = .001,
step.scale = 1e-12, stopping.deriv = .01) {
iteration <- 0
deriv <- Inf
while ((iteration < maximum.iterations) && (abs(deriv) > stopping.deriv)) {
iteration <- iteration + 1
mse.1 <- mean((gmp$pcgmp - y0*gmp$pop^a)^2)
mse.2 <- mean((gmp$pcgmp - y0*gmp$pop^(a+deriv.step))^2)
deriv <- (mse.2 - mse.1)/deriv.step
a <- a - step.scale*deriv
}
fit <- list(a=a,iterations=iteration,
converged=(iteration < maximum.iterations))
return(fit)
}
I Problem: Why type out the same calculation of the MSE
twice?
I Solution: Declare a function
Fourth Attempt

estimate.scaling.exponent.3 <- function(a, y0=6611,


maximum.iterations=100, deriv.step = .001,
step.scale = 1e-12, stopping.deriv = .01) {
iteration <- 0
deriv <- Inf
mse <- function(a) { mean((gmp$pcgmp - y0*gmp$pop^a)^2) }
while ((iteration < maximum.iterations) && (abs(deriv) > stopping.deriv)) {
iteration <- iteration + 1
deriv <- (mse(a+deriv.step) - mse(a))/deriv.step
a <- a - step.scale*deriv
}
fit <- list(a=a,iterations=iteration,
converged=(iteration < maximum.iterations))
return(fit)
}
I Problem: Locked in to using specific columns of gmp;
shouldn’t have to re-write just to compare two data sets
I Solution: More arguments, with defaults
Fifth Attempt

estimate.scaling.exponent.4 <- function(a, y0=6611,


response=gmp$pcgmp, predictor = gmp$pop,
maximum.iterations=100, deriv.step = .001,
step.scale = 1e-12, stopping.deriv = .01) {
iteration <- 0
deriv <- Inf
mse <- function(a) { mean((response - y0*predictor^a)^2) }
while ((iteration < maximum.iterations) && (abs(deriv) > stopping.deriv)) {
iteration <- iteration + 1
deriv <- (mse(a+deriv.step) - mse(a))/deriv.step
a <- a - step.scale*deriv
}
fit <- list(a=a,iterations=iteration,
converged=(iteration < maximum.iterations))
return(fit)
}
I Respecting the interfaces: We could turn the while() loop into
a for() loop, and nothing outside the function would care
estimate.scaling.exponent.5 <- function(a, y0=6611,
response=gmp$pcgmp, predictor = gmp$pop,
maximum.iterations=100, deriv.step = .001,
step.scale = 1e-12, stopping.deriv = .01) {
mse <- function(a) { mean((response - y0*predictor^a)^2) }
for (iteration in 1:maximum.iterations) {
deriv <- (mse(a+deriv.step) - mse(a))/deriv.step
a <- a - step.scale*deriv
if (abs(deriv) <= stopping.deriv) { break() }
}
fit <- list(a=a,iterations=iteration,
converged=(iteration < maximum.iterations))
return(fit)
}
Avoid using loops in R

I In R it is generally suggested that we avoid using for() loops


as a tool for iteration. Instead we can perform iterative work
through the following ways:
I Indexing with conditionals statements and by vectorization

x[x>2]
sum(x*y)
Avoid using loops in R

I In R it is generally suggested that we avoid using for() loops


as a tool for iteration. Instead we can perform iterative work
through the following ways:
I Indexing with conditionals statements and by vectorization

x[x>2]
sum(x*y)

I Using apply family of functions: R offers a family of apply


functions, which allow you to apply a function across different
chunks of data. This offers an alternative to explicit iteration
using for() loop. Further this can be simpler and faster,
though not always.
Apply family

I A quick overview of these functions is as follows:


I apply(): apply a function to rows or columns of a matrix or
data frame
I lapply(): apply a function to elements of a list or vector
I sapply(): same as the above, but simplify the output (if
possible)
I tapply(): apply a function to levels of a factor vector.
Using apply()

I The apply() function takes inputs of the following form:


I apply(x, MARGIN=1, FUN=my.fun), to apply my.fun() across
rows of a matrix or data frame x.
I apply(x, MARGIN=2, FUN=my.fun), to apply my.fun() across
columns of a matrix or data frame x.
Using apply()

I The apply() function takes inputs of the following form:


I apply(x, MARGIN=1, FUN=my.fun), to apply my.fun() across
rows of a matrix or data frame x.
I apply(x, MARGIN=2, FUN=my.fun), to apply my.fun() across
columns of a matrix or data frame x.
I Let us find the minimum entry in each column of the dataset
airquality in R

mydata=na.omit(airquality)
apply(mydata, MARGIN=2, FUN=min)

Ozone Solar.R Wind Temp Month Day


1.0 7.0 2.3 57.0 5.0 1.0
I Suppose we need to find which observations are maximum
with respect to each variable.

apply(mydata, MARGIN=2, FUN=which.max)

Ozone Solar.R Wind Temp Month Day


77 12 30 79 83 24
I Suppose we need to find which observations are maximum
with respect to each variable.

apply(mydata, MARGIN=2, FUN=which.max)

Ozone Solar.R Wind Temp Month Day


77 12 30 79 83 24

I In fact this technique is particularly useful for finding the


summary of each variable.

apply(mydata, MARGIN=2, FUN=summary)

Ozone Solar.R Wind Temp Month Day


Min. 1.0000 7.0000 2.30000 57.00000 5.000000 1.00000
1st Qu. 18.0000 113.5000 7.40000 71.00000 6.000000 9.00000
Median 31.0000 207.0000 9.70000 79.00000 7.000000 16.00000
Mean 42.0991 184.8018 9.93964 77.79279 7.216216 15.94595
3rd Qu. 62.0000 255.5000 11.50000 84.50000 9.000000 22.50000
Max. 168.0000 334.0000 20.70000 97.00000 9.000000 31.00000
I It is possible that we can use apply a user defined function
over different rows and columns but then we need to define
the function explicitly beforehand.
I (Example continued) Suppose we form a function which
computes the 10% symmetric trimmed mean and then apply
as above.

trimmed_mean = function(v) {
q1 = quantile(v, prob=0.1)
q2 = quantile(v, prob=0.9)
return(mean(v[q1 <= v & v <= q2]))
}

apply(mydata, MARGIN=2, FUN=trimmed_mean)

Ozone Solar.R Wind Temp Month Day


37.177778 189.764045 9.927957 78.000000 7.216216 15.532609
I Sometimes it is more convenient to define the function “on the
fly” instead of defining it beforehand.
I We can alternatively define our trimmed mean function
directly.

apply(state.x77, MARGIN=2, FUN=function(v) {


q1 = quantile(v, prob=0.1)
q2 = quantile(v, prob=0.9)
return(mean(v[q1 <= v & v <= q2]))
})

Population Income Illiteracy Life Exp Murder HS Grad


3384.27500 4430.07500 1.07381 70.91775 7.29750 53.33750
Frost Area
104.68293 56575.72500
I Suppose the user defined function needs to have some extra
arguments. It is possible to pass extra arguments to the
function through apply(). More specifically we can use:
apply(x, MARGIN=1, FUN=my.fun, extra.arg.1, extra.arg.2),
for two extra arguments extra.arg.1, extra.arg.2 to be passed
to my.fun().
I We can extend our trimmed mean function to specify the
trimming percentage.

# Our custom function: trimmed mean, with user-specified percentiles


trimmed.mean = function(v, p1, p2) {
q1 = quantile(v, prob=p1)
q2 = quantile(v, prob=p2)
return(mean(v[q1 <= v & v <= q2]))
}

apply(state.x77, MARGIN=2, FUN=trimmed.mean, p1=0.01, p2=0.99)

Population Income Illiteracy Life Exp Murder HS Grad


3974.125000 4424.520833 1.136735 70.882708 7.341667 53.131250
Frost Area
104.895833 61860.687500
What’s the return argument?

I What kind of data type will apply() give us? Depends on what
function we pass. Suppose we have FUN=my.fun(), then:
I if my.fun() returns a single value, then apply() will return a
vector.
I if my.fun() returns k values, then apply() will return a matrix
with k rows (note: this is true regardless of whether
MARGIN=1 or MARGIN=2).
I if my.fun() returns different length outputs for different inputs,
then apply() will return a list.
I if my.fun() returns a list, then apply() will return a list.
A word of caution

I The apply concept in most of the times is useful but we should


not make overuse the apply paradigm! There’s lots of
functions that are optimized for specific tasks and are both
simpler and faster than using apply().
A word of caution

I The apply concept in most of the times is useful but we should


not make overuse the apply paradigm! There’s lots of
functions that are optimized for specific tasks and are both
simpler and faster than using apply().

I For example
I rowSums(), colSums(): for computing row, column sums of a
matrix
I rowMeans(), colMeans(): for computing row, column means of
a matrix
I max.col(): for finding the maximum position in each row of a
matrix
A word of caution

I The apply concept in most of the times is useful but we should


not make overuse the apply paradigm! There’s lots of
functions that are optimized for specific tasks and are both
simpler and faster than using apply().

I For example
I rowSums(), colSums(): for computing row, column sums of a
matrix
I rowMeans(), colMeans(): for computing row, column means of
a matrix
I max.col(): for finding the maximum position in each row of a
matrix
I Combining these functions with logical indexing and vectorized
operations will enable you to do quite a lot.
I E.g., how to count the number of positives in each row of a
matrix?

x = matrix(rnorm(9), 3, 3)
# Don't do this (much slower for big matrices)
apply(x, MARGIN=1, function(v) { return(sum(v > 0)) })

[1] 2 1 0

# Do this insted (much faster, simpler)


rowSums(x > 0)

[1] 2 1 0
Using lapply()
The lapply() function takes inputs as in: lapply(x, FUN=my.fun),
to apply my.fun() across elements of a list or vector x. The output
is always a list.
Consider the following
x=2:5
lapply(x, FUN=log) #same as log(x)

[[1]]
[1] 0.6931472

[[2]]
[1] 1.098612

[[3]]
[1] 1.386294

[[4]]
[1] 1.609438
I Let us prepare a list and apply mean function to every element
of a list

my.list=list(nums=c(0.1,0.2,0.3),chars=c("a", "b", "c"),bools=c(FALSE,TRUE, FALSE))


lapply(my.list, FUN=mean) # Get a warning: mean() can't be applied to chars

Warning in mean.default(X[[i]], ...): argument is not numeric or logical: returning NA

$nums
[1] 0.2

$chars
[1] NA

$bools
[1] 0.3333333

lapply(my.list, FUN=summary)

$nums
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.10 0.15 0.20 0.20 0.25 0.30

$chars
Length Class Mode
3 character character

$bools
Mode FALSE TRUE
logical 2 1
Using sapply()
The sapply() function works just like lapply(), but tries to simplify
the return value whenever possible. E.g., most common is the
conversion from a list to a vector
Let us use sapply() in the previous example
sapply(my.list, FUN=mean) # Simplifies the result, now a vector

Warning in mean.default(X[[i]], ...): argument is not numeric or logical: returning NA

nums chars bools


0.2000000 NA 0.3333333

sapply(my.list, FUN=summary) # Can't simplify, so still a list

$nums
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.10 0.15 0.20 0.20 0.25 0.30

$chars
Length Class Mode
3 character character

$bools
Mode FALSE TRUE
logical 2 1
Using tapply()

The function tapply() takes inputs as in: tapply(x,


INDEX=my.index, FUN=my.fun), to apply my.fun() to subsets of
entries in x that share a common level in my.index
Suppose we want to compute the mean and sd of the Frost for
each region.
tapply(state.x77[,"Frost"], INDEX=state.region, FUN=mean)

Northeast South North Central West


132.7778 64.6250 138.8333 102.1538

tapply(state.x77[,"Frost"], INDEX=state.region, FUN=sd)

Northeast South North Central West


30.89408 31.30682 23.89307 68.87652
Using split()

The function split() split up the rows of a data frame by levels of a


factor, as in: split(x, f=my.index) to split a data frame x according
to levels of my.index. Suppose we want to split up the iris dataset
according to species.
state.by.reg = split(data.frame(state.x77), f=state.region)
class(state.by.reg) # The result is a list

[1] "list"

names(state.by.reg) # This has 4 elements for the 4 regions

[1] "Northeast" "South" "North Central" "West"

class(state.by.reg[[1]]) # Each element is a data frame

[1] "data.frame"
# For each region, display the first 3 rows of the data frame
lapply(state.by.reg, FUN=head, 3)

$Northeast
Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
Connecticut 3100 5348 1.1 72.48 3.1 56.0 139 4862
Maine 1058 3694 0.7 70.39 2.7 54.7 161 30920
Massachusetts 5814 4755 1.1 71.83 3.3 58.5 103 7826

$South
Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708
Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945
Delaware 579 4809 0.9 70.06 6.2 54.6 103 1982

$`North Central`
Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
Illinois 11197 5107 0.9 70.14 10.3 52.6 127 55748
Indiana 5313 4458 0.7 70.88 7.1 52.9 122 36097
Iowa 2861 4628 0.5 72.56 2.3 59.0 140 55941

$West
Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432
Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417
California 21198 5114 1.1 71.71 10.3 62.6 20 156361
# For each region, average each of the 8 numeric variables
lapply(state.by.reg, FUN=function(df) {
return(apply(df, MARGIN=2, mean))
})

$Northeast
Population Income Illiteracy Life.Exp Murder HS.Grad
5495.111111 4570.222222 1.000000 71.264444 4.722222 53.966667
Frost Area
132.777778 18141.000000

$South
Population Income Illiteracy Life.Exp Murder HS.Grad
4208.12500 4011.93750 1.73750 69.70625 10.58125 44.34375
Frost Area
64.62500 54605.12500

$`North Central`
Population Income Illiteracy Life.Exp Murder HS.Grad
4803.00000 4611.08333 0.70000 71.76667 5.27500 54.51667
Frost Area
138.83333 62652.00000

$West
Population Income Illiteracy Life.Exp Murder HS.Grad
2.915308e+03 4.702615e+03 1.023077e+00 7.123462e+01 7.215385e+00 6.200000e+01
Frost Area
Split Apply Combine Procedure

I This can be extended to a further general structure, which we


may call split-apply-combine strategy. It is a combination of
the following three steps:

1. Split the data object into some convenient chunks.


2. Apply the function of interest over each data chunks.
3. Combine the results from each chunk in a convenient structure.
Split Apply Combine Procedure

I This can be extended to a further general structure, which we


may call split-apply-combine strategy. It is a combination of
the following three steps:

1. Split the data object into some convenient chunks.


2. Apply the function of interest over each data chunks.
3. Combine the results from each chunk in a convenient structure.

I Often the apply and combine steps can be performed for us by


a single call to the appropriate function from the apply()
family.
I The split-apply-combine strategy is simple to conceptualize
and very effective in the sense that we essentially require less
lines of code as compared to a usual for() loop.
Example

I The strikes data set contains information on 18 countries over


35 years (compiled by Bruce Western, in the Sociology
Department at Harvard University). The measured variables
are:
I country, year: country and year of data collection
I strike.volume: days on strike per 1000 workers
I unemployment: unemployment rate
I inflation: inflation rate
I left.parliament: left wing share of the government
I centralization: centralization of unions
I density: density of unions
Example (Contd.)

strikes.df = read.csv("C:/Users/hp/Desktop/pendrive/R course/PG new/strikes.csv")


dim(strikes.df)

[1] 625 8

head(strikes.df)

country year strike.volume unemployment inflation left.parliament


1 Australia 1951 296 1.3 19.8 43.0
2 Australia 1952 397 2.2 17.2 43.0
3 Australia 1953 360 2.5 4.3 43.0
4 Australia 1954 3 1.7 0.7 47.0
5 Australia 1955 326 1.4 2.0 38.5
6 Australia 1956 352 1.8 6.3 38.5
centralization density
1 0.3748588 NA
2 0.3751829 NA
3 0.3745076 NA
4 0.3710170 NA
5 0.3752675 NA
6 0.3716072 NA
Example (Contd.)

I Is there a relationship between a country’s ruling party


alignment (left versus right) and the volume of strikes?
I How do we answer this question statistically?
Example (Contd.)

I Is there a relationship between a country’s ruling party


alignment (left versus right) and the volume of strikes?
I How do we answer this question statistically?

I One way is to understand the relationship is to fit linear


models separately for each of the 18 countries and check
whether the variable left.parliament has significant effect on
the response variable strike.volume.
Example (Contd.)

I Is there a relationship between a country’s ruling party


alignment (left versus right) and the volume of strikes?
I How do we answer this question statistically?

I One way is to understand the relationship is to fit linear


models separately for each of the 18 countries and check
whether the variable left.parliament has significant effect on
the response variable strike.volume.

I Computationally this can be executed in R in at least 3 ways:


I Worst way: manually write 18 separate code blocks
I Bad way: explicit for() loop, where we loop over countries
I Best way: split appropriately, then use sapply()
I Let us execute the split-apply-combine strategy through the
following steps.
I (Work with just one chunk of data) So let’s write code to do
regression on the data from (say) just Italy
strikes.df.italy = strikes.df[strikes.df$country=="Italy", ] # Data for It
italy.lm = lm(strike.volume ~ left.parliament, data=strikes.df.italy)
summary(italy.lm)

Call:
lm(formula = strike.volume ~ left.parliament, data = strikes.df.italy)

Residuals:
Min 1Q Median 3Q Max
-930.2 -411.6 -137.3 387.2 1901.4

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -738.75 1200.62 -0.615 0.543
left.parliament 40.29 27.76 1.451 0.156

Residual standard error: 583.3 on 33 degrees of freedom


Multiple R-squared: 0.05999,Adjusted R-squared: 0.0315
F-statistic: 2.106 on 1 and 33 DF, p-value: 0.1562
plot(strikes.df.italy$left.parliament, strikes.df.italy$strike.volume,
main="Italy strike volume versus leftwing alignment",
ylab="Strike volume", xlab="Leftwing alignment")
abline(coef(italy.lm), col=2)
Italy strike volume versus leftwing alignment

2500
2000
Strike volume

1500
1000
500

38 40 42 44 46 48

Leftwing alignment
I (Functionalization) The next step is to turn this into a function
my.strike.lm = function(country.df) {
coef(lm(strike.volume ~ left.parliament, data=country.
}
my.strike.lm(strikes.df.italy)
(Intercept) left.parliament
-738.74531 40.29109
I (Split data into appropriate chunks) Next we shall split our
data into appropriate chunks, each of which can be handled by
our function. For this purpose, the function split() in R is
often helpful: split(df, f=my.factor) splits a data frame df into
several data frames, defined by constant levels of the factor
my.factor. So we want to split strikes.df into 18 smaller data
frames, each of which has the data for just one country.
strikes.by.country = split(strikes.df, f=strikes.df$country)
class(strikes.by.country)
[1] "list"
names(strikes.by.country) # It has one element for each country
[1] "Australia" "Austria" "Belgium" "Canada" "Denmark"
[6] "Finland" "France" "Germany" "Ireland" "Italy"
[11] "Japan" "Netherlands" "New.Zealand" "Norway" "Sweden"
[16] "Switzerland" "UK" "USA"
head(strikes.by.country$Italy) # Same as what we saw before

country year strike.volume unemployment inflation left.parliament


311 Italy 1951 437 8.8 14.3 37.5
312 Italy 1952 337 9.5 1.9 37.5
313 Italy 1953 545 10.0 1.4 40.2
314 Italy 1954 493 8.7 2.4 40.2
315 Italy 1955 511 7.5 2.3 40.2
316 Italy 1956 372 9.3 3.4 40.2
centralization density
311 0.2513799 NA
312 0.2489860 NA
313 0.2482739 NA
314 0.2466577 NA
315 0.2540366 NA
316 0.2457069 NA
I (Apply our function and combine the results) Let us apply our
function to each chunk of data, and combine the results. Here,
the functions lapply() or sapply() are often helpful. So we
want to apply strikes.lm() to each data frame in
strikes.by.country. Think about what the output will be from
each function call: vector of length 2 (intercept and slope), so
we can use sapply().
strikes.coefs = sapply(strikes.by.country, FUN=my.strike.lm)
strikes.coefs
Australia Austria Belgium Canada Denmark
(Intercept) 414.7712254 423.077279 -56.926780 -227.8218 -1399.35735
left.parliament -0.8638052 -8.210886 8.447463 17.6766 34.34477
Finland France Germany Ireland Italy Ja
(Intercept) 108.2245 202.4261408 95.657134 -94.78661 -738.74531 964.73
left.parliament 12.8422 -0.4255319 -1.312305 55.46721 40.29109 -24.07
Netherlands New.Zealand Norway Sweden Switzerland
(Intercept) -32.627678 721.3464 -458.22397 513.16704 -5.1988836
left.parliament 1.694387 -10.0106 10.46523 -8.62072 0.3203399
UK USA
(Intercept) 936.10154 111.440651
left.parliament -13.42792 5.918647
# We don't care about the intercepts, only the slopes (2nd row).
# Some are positive, some are negative! Let's plot them:
plot(1:ncol(strikes.coefs), strikes.coefs[2,], xaxt="n",
xlab="", ylab="Regression coefficient",
main="Countrywise labor activity by leftwing score")
axis(side=1, at=1:ncol(strikes.coefs),
labels=colnames(strikes.coefs), las=2, cex.axis=0.5)
abline(h=0, col="grey")
Countrywise labor activity by leftwing score

40
Regression coefficient

20
0
−20

y
Using plyr

I plyr was among the most downloaded R package of all time.


This is due to many good reasons!
Using plyr

I plyr was among the most downloaded R package of all time.


This is due to many good reasons!

I The plyr package is just another tool for doing


split-apply-combine procedures. Actually plyr adds very little
new functionality to R. What it does do is take the process of
SAC and make it cleaner, more tidy and easier.
Using plyr

I plyr was among the most downloaded R package of all time.


This is due to many good reasons!

I The plyr package is just another tool for doing


split-apply-combine procedures. Actually plyr adds very little
new functionality to R. What it does do is take the process of
SAC and make it cleaner, more tidy and easier.

I plyr functions have a neat naming convention. All plyr


functions are of the form **ply().
I The first two letters of the function tells the input and output
data types, respectively. Replace ** with characters denoting
types:
I First character: input type, one of a, d, l
I Second character: output type, one of a, d, l, or _ (drop)
a*ply() - the input is an array

I The signature for all a*ply() functions is:

a*ply(.data, .margins, .fun, ...)

I Here
I .data : an array
I .margins : index (or indices) to split the array by
I .fun : the function to be applied to each piece
I ... : additional arguments to be passed to the function.
I Note that this looks like:

apply(X, MARGIN, FUN, ...)


Example

I Consider a three dimensional array:


new.array = array(1:27, c(3,3,3))
I Also assign names to the rows, columns and the other
dimension:
rownames(new.array) = c("row1", "row2", "row3")
colnames(new.array) = c("column1", "column2", "column3")
dimnames(new.array)[[3]] = c("Group1", "Group2", "Group3")
I Let us have a final look at the array we created just now:
new.array
, , Group1

column1 column2 column3


row1 1 4 7
row2 2 5 8
row3 3 6 9

, , Group2

column1 column2 column3


row1 10 13 16
row2 11 14 17
row3 12 15 18

, , Group3

column1 column2 column3


I Now we shall different functions of a*ply family and notice the
change in the output.

library(plyr)

Warning: package ’plyr’ was built under R version


4.3.2

aaply(new.array, 1, sum) # the output is an array

row1 row2 row3


117 126 135
adply(new.array, 1, sum) # puts the output in a data frame

X1 V1
1 row1 117
2 row2 126
3 row3 135

alply(new.array, 1, sum) # puts the output in a list

$`1`
[1] 117

$`2`
[1] 126

$`3`
[1] 135

attr(,"split_type")
[1] "array"
attr(,"split_labels")
X1
1 row1
2 row2
3 row3
I Now we change the index which will create a different splitting.

aaply(new.array, 2:3, sum) # Get back a 3 x 3 array

X2
X1 Group1 Group2 Group3
column1 6 33 60
column2 15 42 69
column3 24 51 78

adply(new.array, 2:3, sum) # Get back a data frame

X1 X2 V1
1 column1 Group1 6
2 column2 Group1 15
3 column3 Group1 24
4 column1 Group2 33
5 column2 Group2 42
6 column3 Group2 51
7 column1 Group3 60
8 column2 Group3 69
9 column3 Group3 78
alply(new.array, 2:3, sum) # Get back a list

$`1`
[1] 6

$`2`
[1] 15

$`3`
[1] 24

$`4`
[1] 33

$`5`
[1] 42

$`6`
[1] 51

$`7`
[1] 60

$`8`
[1] 69

$`9`
l*ply() - the input is a list

The signature for all l*ply() functions is:

l*ply(.data, .fun, ...)

Here
I .data : a list
I .fun : the function to be applied to each element
I ... : additional arguments to be passed to the function
Note that this looks like:

lapply(X, FUN, ...)


my.list = list(nums=rnorm(1000), lets=letters, pops=state.x77[,"Population"])
laply(my.list, range) # Get back an array

1 2
[1,] "-3.66418302870311" "2.6689240524252"
[2,] "a" "z"
[3,] "365" "21198"
ldply(my.list, range) # Get back a data frame

.id V1 V2
1 nums -3.66418302870311 2.6689240524252
2 lets a z
3 pops 365 21198

llply(my.list, range) # Get back a list

$nums
[1] -3.664183 2.668924

$lets
[1] "a" "z"

$pops
[1] 365 21198
laply(my.list, summary) # Doesn't work! Outputs have different types/lengths

Error: Results must have one or more dimensions.

ldply(my.list, summary) # Doesn't work! Outputs have different types/lengths

Error in list_to_dataframe(res, attr(.data, "split_labels"), .id,


id_as_factor): Results do not have equal lengths

llply(my.list, summary) # Works just fine

$nums
Min. 1st Qu. Median Mean 3rd Qu. Max.
-3.66418 -0.70961 -0.01693 -0.02758 0.63021 2.66892

$lets
Length Class Mode
26 character character

$pops
Min. 1st Qu. Median Mean 3rd Qu. Max.
365 1080 2838 4246 4968 21198
The fourth option for * I

I The fourth option for * is _: the function a_ply() (or l*ply())


has no explicit return object, but still runs the given function
over the given array (or list), possibly producing side effects.

par(mfrow=c(3,3), mar=c(4,4,1,1))
a_ply(new.array, 2:3, plot, ylim=range(new.array), pch=19, c
The fourth option for * II
25

25

25
piece

piece

piece
15

15

15
5

5
0

0
1.0 1.5 2.0 2.5 3.0 1.0 1.5 2.0 2.5 3.0 1.0 1.5 2.0 2.5 3.0

Index Index Index


25

25

25
piece

piece

piece
15

15

15
5

5
0

0
1.0 1.5 2.0 2.5 3.0 1.0 1.5 2.0 2.5 3.0 1.0 1.5 2.0 2.5 3.0

Index Index Index


25

25

25
piece

piece

piece
15

15

15
5

5
0

0
1.0 1.5 2.0 2.5 3.0 1.0 1.5 2.0 2.5 3.0 1.0 1.5 2.0 2.5 3.0

Index Index Index


d*ply() : the input is a data frame

I The signature for all d*ply() functions is:


d*ply(.data, .variables, .fun, ...)
I Here
I .data: data frame
I .variables : variable (or variables) to split the data frame by
I .fun : the function to be applied to each piece
I ... : additional arguments to be passed to the function
I Note that this resembles:
tapply(X, INDEX, FUN, ...)
Strikes data set, revisited

#Regression coefficients separately for each country, old way:


strikes.list = split(strikes.df, f=strikes.df$country)
strikes.coefs = sapply(strikes.list, my.strike.lm)
head(strikes.coefs)

Australia Austria Belgium Canada Denmark


(Intercept) 414.7712254 423.077279 -56.926780 -227.8218 -1399.35735
left.parliament -0.8638052 -8.210886 8.447463 17.6766 34.34477
Finland France Germany Ireland Italy Japan
(Intercept) 108.2245 202.4261408 95.657134 -94.78661 -738.74531 964.73750
left.parliament 12.8422 -0.4255319 -1.312305 55.46721 40.29109 -24.07595
Netherlands New.Zealand Norway Sweden Switzerland
(Intercept) -32.627678 721.3464 -458.22397 513.16704 -5.1988836
left.parliament 1.694387 -10.0106 10.46523 -8.62072 0.3203399
UK USA
(Intercept) 936.10154 111.440651
left.parliament -13.42792 5.918647
# Getting regression coefficient separately for each country, new way:
strikes.coefs.a = daply(strikes.df, .(country), my.strike.lm)
head(strikes.coefs.a) # Get back an array, note the difference to sapply()

country (Intercept) left.parliament


Australia 414.77123 -0.8638052
Austria 423.07728 -8.2108864
Belgium -56.92678 8.4474627
Canada -227.82177 17.6766029
Denmark -1399.35735 34.3447662
Finland 108.22451 12.8422018
strikes.coefs.d = ddply(strikes.df, .(country), my.strike.lm)
head(strikes.coefs.d) # Get back a data frame

country (Intercept) left.parliament


1 Australia 414.77123 -0.8638052
2 Austria 423.07728 -8.2108864
3 Belgium -56.92678 8.4474627
4 Canada -227.82177 17.6766029
5 Denmark -1399.35735 34.3447662
6 Finland 108.22451 12.8422018
strikes.coefs.l = dlply(strikes.df, .(country), my.strike.lm)
head(strikes.coefs.l) # Get back a list

$Australia
(Intercept) left.parliament
414.7712254 -0.8638052

$Austria
(Intercept) left.parliament
423.077279 -8.210886

$Belgium
(Intercept) left.parliament
-56.926780 8.447463

$Canada
(Intercept) left.parliament
-227.8218 17.6766

$Denmark
(Intercept) left.parliament
-1399.35735 34.34477

$Finland
(Intercept) left.parliament
108.2245 12.8422
Splitting on two (or more) variables

I The function d*ply() makes it very easy to split on two (or


more) variables: we just specify them, separated by a “,” in the
.variables argument.

#First create a variable that indicates whether the year is pre 1975, and add
# it to the data frame
strikes.df$yearPre1975 = strikes.df$year <= 1975
# Then use (say) ddply() to compute regression coefficients for each country

# pre and post 1975


strikes.coefs.1975 = ddply(strikes.df, .(country, yearPre1975), my.strike.lm)
dim(strikes.coefs.1975) # Note that there are 18 x 2 = 36 rows

[1] 36 4
head(strikes.coefs.1975)

country yearPre1975 (Intercept) left.parliament


1 Australia FALSE 973.34088 -11.8094991
2 Australia TRUE -169.59900 12.0170866
3 Austria FALSE 19.51823 -0.3470889
4 Austria TRUE 400.83004 -7.7051918
5 Belgium FALSE -4182.06650 148.0049261
6 Belgium TRUE -103.67439 9.5802824
# We can also create factor variables on-the-fly with I(), as we've seen before
strikes.coefs.1975 = ddply(strikes.df, .(country, I(year<=1975)), my.strike.lm)
dim(strikes.coefs.1975) # Again, there are 18 x 2 = 36 rows

[1] 36 4
head(strikes.coefs.1975)

country I(year <= 1975) (Intercept) left.parliament


1 Australia FALSE 973.34088 -11.8094991
2 Australia TRUE -169.59900 12.0170866
3 Austria FALSE 19.51823 -0.3470889
4 Austria TRUE 400.83004 -7.7051918
5 Belgium FALSE -4182.06650 148.0049261
6 Belgium TRUE -103.67439 9.5802824

You might also like