A Programmer's Guide To R
A Programmer's Guide To R
Serafin Schoch
June 17, 2024
Introduction Outline
The programming language R can be both sur- I. Vectors and Piping: In this section, we ex-
prisingly convenient and irritating. This guide plore why R, a functional programming
will highlight some of R’s unconventional be- language, lacks a map() function similar
havior and explain it. Additionally, I’ll cover to Rust, Haskell, or Java, and why the
basic concepts from other programming lan- length of "Hello World" is 1. We then
guages and either show how to do it in Base R demonstrate how the pipe operator in R
or reference a package that covers it. can simplify function application and en-
R is a rather old language first released in hance code readability.
2000 and built upon the even older language S,
II. Lists and Dictionaries: Here we explore
both primarily used in academic contexts (see
why accessing elements of a vector and a
History of R). This means Base R is old, and
list are the same, although they look dif-
many of beloved concepts don’t exist in plain
ferent on the surface. In the same go will
R or behave in unexpected ways. But don’t
look at accessing values by names and
worry, you don’t need to relearn everything
how to use "dictionaries" in R.
or miss out on these more modern function-
alities. R comes with an ecosystem of well- III. Dataframes: We’ll explore the use of
crafted packages. Especially the "tidyverse" dataframes in R and the advantages of us-
package — a collection of several smaller pack- ing the dplyr package for data manipula-
ages — fills the gaps between Base R and mod- tion. By comparing base R and dplyr syn-
ern programming languages. These packages tax, we demonstrate how dplyr simplifies
consolidate recent developments in design of and enhances data operations.
programming languages into neat wrappers
and functions. Seriously, if you don’t use R IV. Functions and Methods: In this section,
with the appropriate packages, you’re just us- we explore the versatility of functions as
ing legacy code (see Best Practices for R). objects, the utility of assertions for error
handling within functions, and the imple-
This guide won’t cover the R basics. I’ll as-
mentation of class-specific methods using
sume you know how to install packages and
R’s object-oriented features.
use them, or at least you know how to figure
it out yourself. I’ll try to build a solid funda- V. R Package and Rust: We’ll have a look at
mental understanding of R that one would not creating R packages with a Rust backend.
easily find by googling or using other search Rust integration is streamlined with rex-
engines. Certainly, you can get an even bet- tendr. We’ll use devtools to create and
ter understanding by reading documentations. publish a package.
In contrast to the documentations, I’ll provide
you with an opinionated and concise selection Some resources: CheatSheet collection, the
of fundamentals. same on GitHub and R Language Definition.
A Programmer’s Guide to R • Serafin Schoch • June 17, 2024
2
A Programmer’s Guide to R • Serafin Schoch • June 17, 2024
Some of the most commonly used data struc- vec <- c("A","B","C")
tures are dictionaries and lists, but R doesn’t vec[c(TRUE, FALSE, TRUE)] # "A", "C"
have a dictionary type. Furthermore, lists vec["B" == vec] # returns "B"
in R behave peculiarly: Why does calling vec[-2] # returns "A", "C"
‘some_list[1]‘ return the first element wrapped vec[2:3] <- "X" # vec = "A", "X", "X"
in a list, while ‘some_vec[1]‘ returns just the
first element?
Slicing ## A B C
## 1 2 3
To jump straight to the answer, the ‘[]‘ brack-
ets are used for slicing. The question might be vec["B"]
misleading since slicing a vector also returns
## B
a slice and not an element. However, all ba-
## 2
sic types in R are vectors. In that sense, the
sliced vector containing only one string is as Lists additionally provide the ‘$name‘ dol-
close as we get to a single string, meaning that lar syntax, which does the same as ‘[["name"]]‘.
"A" and ‘some_vec[1]‘ are essentially the same.
This does not hold for lists.
We can access the elements of a list with the list <- list(A=1, B=2, C=3)
‘[[]]‘ brackets. We can also use this on vectors, identical(list[["A"]], list$A)
but we’ll receive just the same as we do by
slicing. ## [1] TRUE
3
A Programmer’s Guide to R • Serafin Schoch • June 17, 2024
III. Dataframes
Dataframes are probably the most used data characters %>% # dplyr
type in R, and this is where the ‘dplyr‘ pack- filter(Age < 50) %>%
age becomes invaluable. Under the hood, select(Name, Skill) %>%
dataframes are lists of lists (and vectors) with arrange(Name)
some additional constraints (e.g., all contained
lists must have the same length). A tibble is ## # A tibble: 2 x 2
a dataframe with some extra verifications and ## Name Skill
fewer automatic transformations, reducing un- ## <chr> <chr>
expected mistakes. Additionally, it is the stan- ## 1 Gigi Storytelling
dard dataframe used throughout the tidyverse ## 2 Momo Listening
package.
Following, we will use the made-up data of # grouping and summarizing ------------
some Momo characters to demonstrate filter- tapply(# base R
ing, selecting, summarizing, and grouping of characters$Height,
dataframes. First, I’ll show how to do it in cut(characters$Age, c(0, 50, 100)),
base R, and instead of explaining, I’ll just pro- mean
vide the syntax using the ‘dplyr‘ package. )
## (0,50] (50,100]
characters <- tibble( ## 135 180
Name = c("Momo", "Beppo", "Gigi"),
Age = c(12, 52, 27), characters %>% # dplyr
Skill = c("Listening", "Steady Pace", group_by(age_cat =
"Storytelling"), cut(Age, c(0,50,100))) %>%
Height = c(120, 180, 150), summarise(avg_height = mean(Height))
)
characters ## # A tibble: 2 x 2
## age_cat avg_height
## # A tibble: 3 x 4 ## <fct> <dbl>
## Name Age Skill Height ## 1 (0,50] 135
## <chr> <dbl> <chr> <dbl> ## 2 (50,100] 180
## 1 Momo 12 Listening 120
## 2 Beppo 52 Steady Pace 180
## 3 Gigi 27 Storytelling 150
Conclusion
# filter and select -------------------
tmp <- characters[ # base R Base R can do most of the things ‘dplyr‘
characters$Age < 50, can, but ‘dplyr‘ syntax seems to explain it-
c("Name", "Skill")] self. Moreover, the syntax facilitates thinking
tmp[order(tmp$Name),] of more complex transformations. Imagine
having data from different weather stations,
## # A tibble: 2 x 2 and you want the newest measurement of each
## Name Skill station. How would you do it in base R? With
## <chr> <chr> ‘dplyr‘, you’ll group by stations, arrange by
## 1 Gigi Storytelling date, and slice to yield the first element of each
## 2 Momo Listening group. (Cheat Sheets: dplyr and check tidyr
for pivot longer/wider)
4
A Programmer’s Guide to R • Serafin Schoch • June 17, 2024
5
A Programmer’s Guide to R • Serafin Schoch • June 17, 2024
Methods future(worker)
In addition to functions, R has generics (meth- ## [1] "irrelevant"
ods) that have different implementations de-
pending on the class of the first argument pro- future(4)
vided. Every R object has a class attribute (a
vector of strings). You can manually modify ## Error in UseMethod("future"):
the class attribute, potentially breaking the be- nicht anwendbare Methode für ’future’
havior of some functions. Classes provide the auf Objekt der Klasse "c(’double’,
backbone for various operations, such as cor- ’numeric’)" angewendet
rectly handling additions for both normal and
complex numbers, and even for adding visual Here, we first create a variable ‘worker‘
elements when building a ggplot. Let’s build which is a numeric vector (4). Then we as-
our own class and generics. sign the class ‘proletariat‘ to the worker to de-
fine its class. We see ‘worker‘ is now aware
worker <- 4 of its class. However, since we did not im-
class(worker) plement any methods for this class, it still be-
haves like a number. Next, we show how
## [1] "numeric" a worker should behave when we calculate
the mean. We do this by adding the method
class(worker) <- "proletariat"
‘mean.proletariat‘. Now, whenever ‘mean‘ is
class(worker)
called on an object with the class ‘proletariat‘,
## [1] "proletariat" the string "doesn’t work" is returned. Yay, the
worker has learned: ‘mean(worker)‘ doesn’t
mean(worker) work. We can even change the behavior of
the ‘+‘ function. Notice we have to use ‘""‘
## [1] 4 since ‘+‘ is a special symbol. This way, we can
teach the proletariat what to do when multi-
mean.proletariat <- function(x) ple of them meet up — protest. For now, we
"doesn't work" added methods for already existing functions.
mean(worker) If we want to create a new function that acts
as a method, we use the ‘UseMethod‘ function.
## [1] "doesn't work"
This way, we create the function ‘future‘ and
"+.proletariat" <- function(a, b) { define its behavior for the proletariat. Now
if (sum(a, b) >= 10) "protest" else we can also check what the future of a worker
sum(a, b) * 0.75 is: irrelevance. Since no default method is im-
} plemented for ‘future‘ (‘future.default <- ...‘),
worker + worker the call to ‘future(4)‘ throws an error.
## [1] 6 Conclusions
worker + worker + worker We have learned how to throw errors and how
environments resolve variables. Additionally,
## [1] "protest"
we explored class-specific functionalities and
future <- function(x) adapting basic functions like addition. These
UseMethod("future") are valuable tools for creating more complex
future.proletariat <- function(x) code. For further reading, refer to the R Lan-
"irrelevant" guage Definition.
6
A Programmer’s Guide to R • Serafin Schoch • June 17, 2024