Lec 09
Lec 09
Owen G. Ward
library(tidyverse)
Reading
1
Programming in R
• We have already been programming in R, mostly by writing chunks of code that use
tidyverse functions to do data visualization or wrangling.
• Now we discuss strategies for making our code easier to read and less prone to bugs or
errors.
• One useful tool for this purpose is the pipe, |>, which we’ve seen already
• A principle in programming is “Do Not Repeat Yourself (DRY)”, and writing functions
can help us stay DRY.
Pipes
• Code may involve parallel computations that are assembled at the end.
• For example, suppose you need to read in two tibbles, manipulate each with actions like
filter/pivot longer/mutate, and then join them together.
• Rather than use the pipe in such situations, we should save the tibbles to intermediate
objects before joining them.
R functions – overview
2
R functions
• We will discuss:
• If you find yourself cutting and pasting the same code multiple times (more than twice,
according to the online textbook), then you should consider writing a function.
• See the online textbook for one example. Here is another:
Example Data
• The Boston dataset in the MASS package includes data on house prices (medv) and char-
acteristics of different neighborhoods in Boston
• Certain kinds of statistical analyses of the relationship between medv and the other
variables require that the other variables be standardized, by subtracting the mean values
and dividing by the standard deviation (SD).
Boston dataset
library(MASS)
Boston <- as_tibble(Boston)
Boston
# A tibble: 506 x 14
crim zn indus chas nox rm age dis rad tax ptratio black
<dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.00632 18 2.31 0 0.538 6.58 65.2 4.09 1 296 15.3 397.
2 0.0273 0 7.07 0 0.469 6.42 78.9 4.97 2 242 17.8 397.
3 0.0273 0 7.07 0 0.469 7.18 61.1 4.97 2 242 17.8 393.
4 0.0324 0 2.18 0 0.458 7.00 45.8 6.06 3 222 18.7 395.
5 0.0690 0 2.18 0 0.458 7.15 54.2 6.06 3 222 18.7 397.
6 0.0298 0 2.18 0 0.458 6.43 58.7 6.06 3 222 18.7 394.
7 0.0883 12.5 7.87 0 0.524 6.01 66.6 5.56 5 311 15.2 396.
3
8 0.145 12.5 7.87 0 0.524 6.17 96.1 5.95 5 311 15.2 397.
9 0.211 12.5 7.87 0 0.524 5.63 100 6.08 5 311 15.2 387.
10 0.170 12.5 7.87 0 0.524 6.00 85.9 6.59 5 311 15.2 387.
# i 496 more rows
# i 2 more variables: lstat <dbl>, medv <dbl>
• Not only is this tedious, it is error-prone. Plus, we will need to do the same operation
on other datasets.
A standardization function
• The following function standardizes a vector. (We’ll learn more about the components
of a function in the slides to follow.)
• This has reduced the amount of code and chances for cut-and-paste errors.
4
Components of a function
function(x) {
x^2
}
square_num(4)
[1] 16
square_num()
[1] 0
5
Exercise 1
Re-write our standardize() function to have an additional argument na.rm, set to TRUE by
default.
For reference:
Exercise 1: Solution
Argument defaults
[1] 0
sum_squares(x = 1)
[1] 10
sum_squares(y = 1)
[1] 1
Argument matching
• When you call a function, the arguments are matched first by name, then by “prefix”
matching and finally by position:
6
f <- function(firstarg, secondarg) {
firstarg^2 + secondarg
}
f(firstarg = 1, secondarg = 2)
[1] 3
f(s = 2, f = 1)
[1] 3
f(2, f = 1)
[1] 3
f(1, 2)
[1] 3
… in functions
Problems
7
x <- c(1, 2)
sum(x, na.mr = TRUE)
[1] 4
What happens
f <- function(x = 0) {
x^2
}
f(x = 2)
[1] 4
f <- function(x = 0) {
return(x^2)
}
f(2)
[1] 4
8
Control Flow
• Code within a function is not always executed linearly from start to end.
• We may need to execute different code chunks depending on the function inputs.
• We may need to repeat certain calculations, or loop.
• Such constructs are called control flow.
• We’ll touch on some of the basics.
if and if-else
• if tests a condition and executes code if the condition is true. Optionally, we can couple
with an else to specify code to execute when condition is false.
if ("cat" == "dog") {
print("cat is dog")
} else {
print("cat is not dog")
}
• If the conditions do not result in TRUE or FALSE, you will get a warning or an error. You
may have already encountered these:
9
for loops
n <- 10
nreps <- 100
x <- vector(mode = "numeric", length = nreps)
for (i in 1:nreps) { # or i in seq(nreps)
x[i] <- mean(rnorm(n))
}
summary(x)
print(i)
[1] 100
Exercise 2
Write a function standardize_tibble() that loops through the columns of a tibble and
standardizes each with your standardize() function. (Hints: If tt is a tibble, ncol(tt) is
the number of columns, and 1:ncol(tt) is an appropriate index set. If tt is a tibble, tt[[1]]
is the first column.)
Exercise 2: Solution
• Index sets are sometimes the indices of a vector, and can also be the elements of the
vector.
10
ind <- c("cat", "dog", "mouse")
for (i in seq_along(ind)) {
print(paste("There is a", ind[i], "in my house"))
}
for (i in ind) {
print(paste("There is a", i, "in my house"))
}
while loops
• Use a while loop when you want to continue until some logical condition is met.
set.seed(1)
# Number of coin tosses until first success (geometric dist'n)
p <- 0.1
counter <- 0
success <- FALSE
while (!success) {
success <- as.logical(rbinom(n = 1, size = 1, prob = p))
counter <- counter + 1
}
counter
[1] 4
11
break
for (i in 1:100) {
if (i > 3) break
print(i)
}
[1] 1
[1] 2
[1] 3
Vectorized functions
• The environment within a function is like a map to the memory locations of all its
variables.
• The function arguments are “passed by value”, meaning that a copy is made and stored
in the function’s environment.
• Variables created within the function are also stored in its environment.
f <- function(x) {
y <- x^2
ee <- environment() # Returns ID of environment w/in f
print(ls(ee)) # list objects in ee
ee
}
f(1) # function call
12
[1] "ee" "x" "y"
<environment: 0x107df1120>
Enclosing environments
• Our function f was defined in the global environment, .GlobalEnv, which “encloses” the
environment within f.
• If f needs a variable and can’t find it within f’s environment, it will look for it in the
enclosing environment, and then the enclosing environment of .GlobalEnv, and so on.
• The search() function lists the hierarchy of environments that enclose .GlobalEnv.
search()
• To facilitate this search, each environment includes a pointer to its enclosing environment.
Exercise 3
x <- 1
f <- function(y) {
g <- function(z) {
(x + z)^2
}
g(y)
}
13
• What is the enclosing environment of f()?
• What is the enclosing environment of g()?
• What search order does R use to find the value of x when it is needed in g()?
Exercise 3 Solutions
• The more your code does, the harder it is for others to read.
– Here “others” includes you some time in the future.
– The online text authors say we should write code that future-you can understand,
because past-you doesn’t answer emails.
• See the Functions are for Humans and Computers section (Section 19.3) of the
online text for helpful tips on writing readable code.
• Functions can be used to prevent repetition, but even if used only once they can improve
code readability.
– For example, you are writing a function func() that computes a statistic mystat
that takes 10 lines of code to calculate.
– The rest of your function is only 5 lines.
– Write a function called mystat() and call it from func().
– If you define func() first, it will be easier to document mystat().
• Writing code in a top-down way is like writing an outline for an essay and then filling in
the details.
– The main function is the outline.
– the sub-functions are the details of each topic.
Exercise
14
Exercise 4: Solution
R packages
# install.packages("hapassoc")
library(hapassoc)
search()
Detaching packages
detach("package:hapassoc")
search()
Package namespaces
• Package authors create a list of objects that will be visible to users when the package is
loaded. This list is called the package namespace.
15
• You can access functions in a package’s namespace without loading the package using
the :: operator.
set.seed(321)
n <- 30
x <- (1:n) / n
y <- rnorm(n, mean = x)
ff <- lm(y ~ x)
car::sigmaHat(ff)
[1] 0.926726
16