0% found this document useful (0 votes)

58 views162 pages

Computerstatistik Skriptum

This document provides an introduction to the statistical software R. It describes what R is, how it can be obtained, its basic functionality for calculations and graphics, and popular integrated development environments for using R like RStudio. The document also provides a short example of generating and summarizing simulated data in R for regression analysis.

Uploaded by

Agaliev

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

58 views162 pages

Computerstatistik Skriptum

Uploaded by

Agaliev

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 162

107.

258 Computerstatistik: Script

Laura Vana Gür

Last compiled on 20 January, 2022

Contents
Introduction 2

Data structures and subsetting in R 12

Data storage 41

Flow control 45

R functions 50

Basic statistics in R 55

Basic graphs with R 60

Basic data handling in R 71

Further R topics 81

Introduction to regression modeling in R 91

Computational linear algebra in R 130

Computational approaches to hypothesis testing 139

Numerical optimization and root finding in R 154

Disclaimer: This script is based on a collection of slides used in the Winter Semester 2021/2022 course.
Most of the materials have been adopted and adapted from the course developed by Klaus Nordhausen.

1
Introduction

What is R

• R was developed by Ross Ihaka and Robert Gentleman (the “R & R’s” of the University of Auckland).

• Ihaka, R., Gentleman, R. (1996): R: A language for data analysis and graphics, Journal of Computa-
tional and Graphical Statistics, 5, 299-314.
• R is a environment and language for data manipulation, calculation and graphical display.
• R is a GNU program. This means it is an open source program (as e.g. Linux) and is distributed for
free.
• R is used by more than 2 million users worldwide (according to R Consortium).
• R was originally used by the academic community but it is currently also used by companies like
Google, Pfizer, Microsoft, Bank of America . . .

R communities

• R has local communities worldwide for users to share ideas and learn.
• R events are organized all over the world bringing its users together:
– Conferences (e.g. useR!, WhyR?, eRum)
– R meetups: check out meetup.com

2
R and related languages

• R can be seen as an implementation or dialect of the S language, which was developed at the AT & T
Bell Laboratories by Rick Becker, John Chambers and Allan Wilks.
• The commercial version of S is S-Plus.
• Most programs written in S run unaltered in R, however there are differences.
• Code written in C, C++ or FORTRAN can be run by R too. This is especially useful for
computationally-intensive tasks.

How to get R

• R is available for most operating systems, as e.g. for Unix, Windows, Mac and Linux.
• R can be downloaded from the R homepage http://www.r-project.org

• The R homepage contains besides the download links also information about the R Project and the R
Foundation, as well as a documentation section and links to projects related to R.
• R is available as 32-bit and 64-bit
• R comes normally with 14 base packages and 15 recommended packages

CRAN

• CRAN stands for Comprehensive R Archive Network

• CRAN is a server network that hosts the basic distribution and R add-on packages

• Central server: http://cran.r-project.org

– New R versions are usually released every few weeks.

– current R version: 4.1.2 (Bird Hippie, released on 2021-11-01) as of 2022-01-20

• The R version used in the course is 4.1.2 (as of Winter semester 2021/2022).

R extension packages

• R can be easily extended with more packages, most of them can be downloaded from CRAN too.
Installation and updating of those packages is however also possible with using R itself (18420 are
currently available on CRAN).
• Packages for the analysis and comprehension of genomic data can be downloaded from the Bioconductor
pages (http://www.bioconductor.org).
• but R packages are available form many other sources like R-forge, Github, . . .

Other distributions of R

• As R is open source and published under a GNU license one can make also a own version of R and
distribute it.
• For example Microsoft has Microsoft R Open https://mran.microsoft.com/open
• But there are many others too. We use however here the standard R version from CRAN.

3
What R offers

Among other things R offers:

• an effective data handling and storage facility.

• a suite of operators for calculations on arrays and matrices (R is a vector based language).
• a large, coherent, integrated collection of tools for data analysis.
• graphical facilities for data analysis and display.
• powerful tools for communicating results. R packages make it easy to produce html or pdf reports, or
create interactive websites.
• a well-developed, simple and effective programming language.

Therefore is not only a plain statistics software package, but it can be used as one. Most of the standard
statistics and a lot of the latest methodology is available for R.

R screenshot

R console

• R by default has no graphical interface and the so called Console has to be used instead.
• The Console or Command Line Window is the window of R in which one writes the commands and in
which the (non-graphic) output will be shown.
• Commands can be entered after the prompt (>).
• In one row one normally types one command (enter submits the command). If one wants to put more
commands in one row the commands have to be separated by a “;”.
• When a command line stars with a “+” instead of “>”it means that the last submitted command was
not completed and one should finish it now.
• All submitted commands of a session can be recalled with the up and down arrows ↑↓.

4
R as a pocket calculator

In the console we can for example do basic calculations

> 7 + 11
[1] 18
> 57 - 12
[1] 45
> 12 / 3
[1] 4
> 5 * 4
[1] 20
> 2 ˆ 4
[1] 16
> sin(4)
[1] -0.7568025

R editors and IDEs

• Using the R Console can be quite cumbersome, especially for larger projects. An alternative to the
Command Line Window is the usage of editors or IDEs (integrated development environments).
• Editors are stand-alone applications that can be connected to an installed R version and are used for
editing R source code. The commands are typed and via the menu or key combinations the commands
are submitted. The user has here usually the choice to submit one command at the time or several
commands at once.
• IDEs integrate various development tools (editors, compilers, debuggers, etc.) into a single program -
the user does not have to worry about connecting the individual components
• R has only a very basic editor included which can be started form the menu “File” New script.

• Better editors are EMACS together with ESS, Tinn-R or WinEdt together with R-WinEdt.
These editors offer syntax highlighting and sometimes also templates for certain R structures.
• The most popular IDE is currently probably RStudio.

5
RStudio screenshot

RStudio default view

• The main window in RStudio contains five parts: one Menu and four Windows (“Panes”)
• From the drop-down menu RStudio and R can be controlled.
• Pane 1 (top left) - Files and Data: Editing R-Code and view of data sets
• Pane 2 (top right) - Workspace and History:

– Workspace lists all objects in the workspace

– History shows the complete code that was typed or executed in the console.

• Pane 3 (bottom right) - Files, Plots, Packages, Help:

– Files, to manage files

– Plots, to visualise and export graphics
– Packages, to manage extension packages
– Help, to access information and help pages for R functions and datasets

• Pane 4 (bottom left) - Console: Execution of R-Code

• This pane layout (and the pane contents) can be adapted using the options menu.

R: a short statistical example

A more sophisticated example than the previous one will demonstrate some features of R which will be
explained in detail later in the course.

6
> options(digits = 4)
> # setting random seed to get a reproducible example
> set.seed(1)
> # creating data
> eps <- rnorm(100, 0, 0.5)
> eps[1:5]
[1] -0.31323 0.09182 -0.41781 0.79764 0.16475
> group <- factor(rep(1:3, c(30, 40, 30)),
+ labels = c("group 1", "group 2", "group 3"))
> x <- runif(100, 20, 30)
> y <- 3 * x + 4 * as.numeric(group) + eps

> # putting the variables into a dataset as could be

> # observed in reality
> data.ex <- data.frame(y = y, x = x, group = group)
> # looking at the data
> str(data.ex)
'data.frame': 100 obs. of 3 variables:
$ y : num 71.7 70.7 79.1 72.9 69.6 ...
$ x : num 22.7 22.2 25.2 22.7 21.8 ...
$ group: Factor w/ 3 levels "group 1","group 2",..: 1 1 1 1 1 1 1 1 1 1 ...

Summary of the data can be obtained:

> summary(data.ex)
y x group
Min. : 64.7 Min. :20.3 group 1:30
1st Qu.: 74.3 1st Qu.:21.9 group 2:40
Median : 79.5 Median :23.8 group 3:30
Mean : 81.1 Mean :24.4
3rd Qu.: 86.8 3rd Qu.:26.4
Max. :102.0 Max. :29.8

Now some plots:

> plot(data.ex) # plot 1

7
20 22 24 26 28 30

90 100
y

80
70
28

x
24
20

3.0
group

2.0
1.0
70 80 90 100 1.0 1.5 2.0 2.5 3.0

> plot(y ~ group) # plot 2

100
90
y

80
70

group 1 group 2 group 3

group
Build a linear model:

> # fitting a linear model and looking at it

> lm.fit <- lm(y ~ x + group)
> lm.fit

8
Call:
lm(formula = y ~ x + group)

Coefficients:
(Intercept) x groupgroup 2 groupgroup 3
3.77 3.01 4.07 7.98

> # more detailed output

> summary(lm.fit)

Call:
lm(formula = y ~ x + group)

Residuals:
Min 1Q Median 3Q Max
-1.1988 -0.2797 0.0198 0.2792 1.0893

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.7682 0.4288 8.79 6e-14 ***
x 3.0110 0.0169 178.19 <2e-16 ***
groupgroup 2 4.0666 0.1094 37.18 <2e-16 ***
groupgroup 3 7.9754 0.1201 66.38 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.453 on 96 degrees of freedom

Multiple R-squared: 0.997, Adjusted R-squared: 0.997
F-statistic: 1.1e+04 on 3 and 96 DF, p-value: <2e-16

Check some diagnostic plots:

> # diagnostic plots

> par(mfrow = c(2, 2)); plot(lm.fit)

9
Standardized residuals
Residuals vs Fitted Normal Q−Q
61 61
Residuals

2
0.5

−2 0
−1.0

24 14 14 24

70 80 90 100 −2 −1 0 1 2

Fitted values Theoretical Quantiles

Standardized residuals

Standardized residuals
Scale−Location Residuals vs Leverage
61 14
24 70

2
1.0

0
Cook's distance
67
0.0

−3
14

70 80 90 100 0.00 0.02 0.04 0.06 0.08

Fitted values Leverage

What can we notice from the example?

1. Doing statistics with R is mainly calling ready available functions

2. The main assignment operator in R is “<-”
3. Assigning results or values produces no output
4. Results can be seen by calling the object or not assigning a function
5. R is object oriented, this means depending on the type of input functions perform different tasks
(E.g. the functions plot() or summary)
6. Text or commands after “#” are ignored by R (can be used for commenting code)

Help for using R

On first sight R looks a bit difficult but already with a few basic commands statistical analyses can be done.
To learn about those commands several sources are available:

• Online manuals and tutorials

• Books
• R’s inbuild help systems and example and demo features
• R has its own journal, the R Journal (earlier called R Newsletter) where different topics are explained.
– For example how to handle date formats in R would be in Newsletter 4(1) 2004, pp. 29-32.
• Add on packages are often described in journal articles usually published in the free online journal
Journal of Statistical Software.

Manuals and Tutorials for R

• On the R homepage one can find the official manuals under Documentation -> Manuals. Especially
the “An Introduction to R” Manual is recommended.
• “Unofficial” tutorials and manuals, also in other languages than English can be found also on the R
homepage under Documentation -> Other or on CRAN under Documentation -> Contributed. Very
useful from here is the R reference card by Tom Short.

10
R Tutorials for SAS, Stata or SPSS users

A lot of new R users are familiar with SAS, Stata and/or SPSS and therefore special charts for an overview
how to do things they used to do in SAS, Stata or SPSS can be done in R and a extended manual for an
easier move to R are available.
The following references might then be helpful:

• http://r4stats.com
• Muenchen, R.A. (2008): R for SAS and SPSS Users
• Muenchen, R.A. and Hilbe, J. (2010): R for Stata Users

Help within R

• There are three type of help types available in R. They can be accessed via the menu or the command
line. Here only the command line versions will be explained
• Using an internet browser:
> help.start() will evoke an internet browser with links to manuals, FAQs the help pages off all
functions sorted by packages together with an search engine.
• The help command:
> help(command) will show command. A shorter version that does the same is > ?command. For a
few special commands the help works only when the command is quoted, e.g. > help("if")
• The help.search command
> help.search("keyword") one can search all titles and aliases of the help files for keywords. A
shorter version that does the same is > ??keyword. This is however not a full text search.

There are also three other functions useful to learn about functions.

• apropos: apropos("string") searches all functions that have the string in their function name
• demo: The demo function runs some available scripts to demonstrate their usage. To see which topics
have a demo script submit > demo()
• example: > example(topic) runs all example codes from the help files that belong to the topic topic
or use the function topic.
• Also in case you remember the beginning of a function or are just lazy - R has also an auto completion
feature. If you start typing a command and hit tab R will complete the command if there are no
alternatives or will you give all the alternatives.

Mailing lists for R

• R as one of the main statistical software programs has several mailing lists. There are general mailing
lists or lists of special interest groups like a list for mixed effects models or robust statistics (for details
see the R homepage).
• The general mailing list is R-help where questions are normally answered pretty quickly. But make
sure to read the posting guide before you ask something yourself! The R-help mails are also archived
and can be searched.
• Using on the R homepage the search-link will lead to more information on search resources.
• And last but not least, there is also Stack Overflow.

11
R Markdown

• Mixture of Markdown, a markup language for writing documents in plain text, and “chunks” of code
in R or another programming language.
• Then the input is rendered into a document (aka knitted), R runs the code, automatically collects
printed output and graphics and inserts them in the final document.
• In RStudio it can be created using File -> New File -> R Markdown. A window pop-out where you
can choose the different types of output. Once this is chosen (e.g., a pdf document) a new file will
open with a template.

• The first part of the template is called YAML (Yet Another Markup Language) and contains informa-
tion that will be used when rendering your document.
• The actual document starts after the YAML preamble.

• More information can be found on the RStudio page in TUWEL.

Data structures and subsetting in R

Basic data structures in R

The five most used data structures in R can be categorized using their dimensionality and whether all content
must be of the same type, i.e. if they are homogeneous or heterogeneous.

Homogeneous Heterogeneous
1D vector list
2D matrix data frame
nD array

Scalars as on the previous slide are treated as vectors of length 1. And almost all other types of objects in
R are build upon these five structures.
To understand the structure of an object in R the best is to use

str(object)

12
Vectors in R

The most basic structure is a vector. They come as two different flavors:

• atomic vector
• list

And a vector must have three properties:

• of what type it is (typeof)

• how long it is (length)
• which attributes it has (attributes)

Difference of an atomic vector and a list

• In an atomic vector all elements must be of the same type, whereas in the list the different elements
can be of different types.
• There are four common types for an atomic vector:

– logical
– integer
– double (often refereed to as numeric)
– character

• The basic function to create atomic vectors is c.

The function c()

The most direct way to create a vector is the c function where all values can be the entered. The values are
then concatenated.

object.x <- c(value1, value2, ...)

A single number is also treated like a vector but can be easier assigned to an object:

object.x <- value

Examples of atomic vectors

LogVector <- c(TRUE, FALSE, FALSE, TRUE)

IntVector <- c(1L, 2L, 3L, 4L)
DouVector <- c(1.0, 2.0, 3, 4)
ChaVector <- c("a", "b", "c", "d")

LogVector
[1] TRUE FALSE FALSE TRUE
IntVector
[1] 1 2 3 4

13
DouVector
[1] 1 2 3 4
ChaVector
[1] "a" "b" "c" "d"

Missing values and other special values in R

Missing values in R are specified as NA which is internally a logical vector of length 1.

• If used within c() NA will always be coerced to the correct type of the vector.
• To create NAs of a specific type one can use NA_real_, NA_integer_ or NA_character_.

Inf is infinity. You can have either positive or negative infinity.

> 1 / 0
[1] Inf

NaN means Not a Number. It’s an undefined value.

> 0 / 0
[1] NaN

Checking for the types of a vector

Given a vector it is easy to check which type it is.

The basic function is typeof.
To check for a specific type the “is”-functions can be used:

• is.character
• is.double
• is.integer
• is.logical
• is.atomic

Checking the example vectors

typeof(IntVector)
[1] "integer"
typeof(DouVector)
[1] "double"
is.atomic(IntVector)
[1] TRUE
is.character(IntVector)
[1] FALSE
is.double(IntVector)
[1] FALSE
is.integer(IntVector)

14
[1] TRUE
is.logical(IntVector)
[1] FALSE

The function is.numeric checks if a vector is of type double or integer.

is.numeric(LogVector)
[1] FALSE
is.numeric(IntVector)
[1] TRUE
is.numeric(DouVector)
[1] TRUE
is.numeric(ChaVector)
[1] FALSE

More on data types in R

R has 6 basic data types (the ones shown below + a raw data type used to hold raw bytes).

> x <- "a"

> typeof(x)
[1] "character"

> y <- 1.5

> typeof(y)
[1] "double"

> z <- 1L
> typeof(z)
[1] "integer"

> w <- TRUE

> typeof(w)
[1] "logical"

> k <- 2 + 4i
> typeof(k)
[1] "complex"

Usually, data vectors are not entered by hand in R, but read in as data saved in some other format.
However often vectors with structures are needed and following slides give some useful functions to create
such vectors.

Sequences

To create a vector that has a certain start and ending point and is filled with points that have equal steps
between them, the function seq can be used.

15
x <- seq(from = 0, to = 1, by = 0.2)
x
[1] 0.0 0.2 0.4 0.6 0.8 1.0
y <- seq(length = 6, from = 0, to = 1)
y
[1] 0.0 0.2 0.4 0.6 0.8 1.0
z <- 1:5
z
[1] 1 2 3 4 5

Replications

The function rep can be used to replicate objects in several ways. For details see the help of the function.
Here are some examples

x <- rep(c(2, 1), 3)

x
[1] 2 1 2 1 2 1
y <- rep(c(2, 1), each = 3)
y
[1] 2 2 2 1 1 1
z <- rep(c(2, 1), c(3, 5))
z
[1] 2 2 2 1 1 1 1 1

Vectors with random pattern

The sample function allows us to obtain of random sample of a specified size from certain elements given in
a vector. The following code corresponds to the results of a 6-sided die:

sample(1:6, size = 8, replace = TRUE)

[1] 1 1 3 1 1 6 6 6

Logical operators in R

Logical vectors are usually created by using logical expressions. The logical vector is of the same length as
the original vector and gives elementwise the result for the evaluation of the expression.
The logical operators in R are:

Operator Meaning
‘==‘ =
‘!=‘ ̸ =
‘<‘ <
‘>‘ >
‘>=‘ ≥
‘<=‘ ≤
Two logical expressions L1 and L2 can be combined using:

16
L1 & L2 for L1 and L2
L1 | L2 for L1 or L2
!L1 for the negation of L1
Logical vectors typically created in the following way:

age <- c(42, 45, 67, 55, 37, 73, 77)

older50 <- age > 50
older50
[1] FALSE FALSE TRUE TRUE FALSE TRUE TRUE

When one wants to enter a logical vector TRUE can be abbreviated with T and FALSE with F, this is however
not recommended.

Vector arithmetic

With numeric vectors one normally wants to perform calculations.

When using arithmetic expressions they are usually applied to each element of the vector.
If an expression involves two or more vectors these vectors do not have to be of the same length, the shorter
vectors will be recycled until they have as many elements as the longest vector.
Important expressions here are:

+, -, *, /, ^, log, sin, cos, tan, sqrt,

min, max, range, mean, var

Here is an short example for vector arithmetic and the recycling of vectors

x <- 1:4
x
[1] 1 2 3 4
y <- rep(c(1,2), c(2,4))
y
[1] 1 1 2 2 2 2
x ˆ 2
[1] 1 4 9 16
x + y
Warning in x + y: longer object length is not a multiple of shorter object
length
[1] 2 3 5 6 3 4

Basic operations on character vectors

Taking substrings using substr (alternatively substring can be used but it has slightly different argument):

cols <- c("red", "blue", "magenta", "yellow")

substr(cols, start = 1, stop = 3)

[1] "red" "blu" "mag" "yel"

Building up strings by concatenation within elements using paste:

17
paste(cols, "flowers")

[1] "red flowers" "blue flowers" "magenta flowers" "yellow flowers"

paste(cols, "flowers", sep = "_")

[1] "red_flowers" "blue_flowers" "magenta_flowers" "yellow_flowers"

paste(cols, "flowers", collapse = ", ")

[1] "red flowers, blue flowers, magenta flowers, yellow flowers"

Coercion

• As all elements in an atomic vector must be of the same type it is of course of interest what happens
if they aren’t.
• In that case the different elements will be coerced to the most flexible type.
• The most flexible type is usually character. But for example a logical vector can be coerced to an
integer or double vector where TRUE becomes 1 and FALSE a 0.
• Coercion order: logical -> integer -> double -> (complex) -> character

v1 <- c(1, 2L)

typeof(v1)
[1] "double"
v2 <- c(v1, "a")
typeof(v2)
[1] "character"
v3 <- c(2L, TRUE, TRUE, FALSE)
typeof(v3)
[1] "integer"

• Coercion often happens automatically. Most mathematical functions try to coerce vectors to numeric
vectors. And on the other hand, logical operators try to coerce to a logical vector.
• In most cases if coercion does not work, a warning or error message is returned.
• In programming to avoid coercion to a possibly wrong type the coercion is forced using the “as”-
functions like as.character, as.double, as.numeric,. . .

LogVector
[1] TRUE FALSE FALSE TRUE
sum(ChaVector)
Error in sum(ChaVector): invalid 'type' (character) of argument
as.numeric(LogVector)
[1] 1 0 0 1
ChaVector2 <- c("0", "1", "7")
as.integer(ChaVector2)
[1] 0 1 7

18
ChaVector3 <- c("0", "1", "7", "b")
as.integer(ChaVector3)
Warning: NAs introduced by coercion
[1] 0 1 7 NA

Lists
Lists are different from atomic vectors as their elements do not have to be of the same type.
To construct a list one usually uses list.

List1 <- list(INT = 1L:3L,

LOG = c(FALSE, TRUE),
DOU = DouVector,
CHA = "z")
str(List1)
List of 4
$ INT: int [1:3] 1 2 3
$ LOG: logi [1:2] FALSE TRUE
$ DOU: num [1:4] 1 2 3 4
$ CHA: chr "z"

The number of components of a list can be obtained using length.

length(List1)
[1] 4

To initialize a list with a certain number of components vector can be used.

List2 <- vector("list", 2)

List2
[[1]]
NULL

[[2]]
NULL

Combining Lists
• Several lists can be combined into one list using c.
• If a combination of lists and atomic vectors is given to c then the function will first coerce each atomic
vector to lists before combining them.

List3 <- c(List1, list(new = 7:10, new2 = c("G", "H")))

str(List3)
List of 6
$ INT : int [1:3] 1 2 3
$ LOG : logi [1:2] FALSE TRUE
$ DOU : num [1:4] 1 2 3 4
$ CHA : chr "z"
$ new : int [1:4] 7 8 9 10
$ new2: chr [1:2] "G" "H"

19
List4 <- list(a = 1, b = 2)
Vec1 <- 3:4
Vec2 <- c(5.0, 6.0)
List5 <- c(List4, Vec1, Vec2)
List6 <- list(List4, Vec1, Vec2)

str(List5)
List of 6
$ a: num 1
$ b: num 2
$ : int 3
$ : int 4
$ : num 5
$ : num 6
str(List6)
List of 3
$ :List of 2
..$ a: num 1
..$ b: num 2
$ : int [1:2] 3 4
$ : num [1:2] 5 6

• The typeof a list is a list.

• is.list can be used to check if an object is a list.
• as.list can be used to coerce to a list.
• To convert a list to an atomic vector unlist can be used. It uses the same coercion rules as c.
• From many statistical functions which return more complicated data structures the results are actually
lists.

Attributes

• All objects in R can have additional attributes to store metadata about the object. The number of
attributes is basically not limited. And it can be thought of as a named list with unique component
names.
• Individual attributes can be accessed using the function attr or all at once using the function
attributes.

Attributes examples

VecX <- 1:5

attr(VecX, "attribute1") <- "I'm a vector"
attr(VecX, "attribute2") <- mean(VecX)
attr(VecX, "attribute1")
[1] "I'm a vector"
attr(VecX, "attribute2")
[1] 3
attributes(VecX)

20
$attribute1
[1] "I'm a vector"

$attribute2
[1] 3
typeof(attributes(VecX))
[1] "list"

Special attributes in R

In R 3 attributes play a special role and we will came back later to them in more detail and just mention
them now shortly:

• names: the names attributes is a character vector giving each element a name. This will be discussed
soon.
• dimension: the dim for dimension attribute will turn vectors in matrices and arrays.
• class: the class attribute is very important in the context of S3 classes discussed later.

Attributes when the object is manipulated

• Depending on the function used attributes might or might not get lost.
• The three special attributes mentioned earlier have special roles and are usually not lost, many other
attributes get however often lost.

attributes(5 * VecX - 7)
$attribute1
[1] "I'm a vector"

$attribute2
[1] 3
attributes(sum(VecX))
NULL
attributes(mean(VecX))
NULL

The names attribute

There are three different ways to name a vector:

1. Directly at creation:

Nvec1 <- c(a = 1, b = 2, c = 3)

Nvec1
a b c
1 2 3

2. By modifying an existing vector in place:

21
Nvec2 <- 1:3
Nvec2
[1] 1 2 3
names(Nvec2) <- c("a", "b", "c")
Nvec2
a b c
1 2 3

3. By creating a modified copy:

Nvec3 <- setNames(1:3, c("a", "b", "c"))

Nvec3
a b c
1 2 3

Properties of names

• Names do not have to be unique

• Not all elements need names. If no element has the name the names attribute value is NULL. If some
elements have names but others not, then missing elements get an empty string as name.
• Names are usually the most useful if all elements have a name and if the names are all unique.
• Name attributes can be removed by assigning names(object) <- NULL.

names(c(a = 1, 2, 3))
[1] "a" "" ""
names(1:3)
NULL

Factors

Categorical data is an important data type in statistics - in R they are usually represented by factors.
A factor in R is basically an integer vector with two attributes:

1. The class attribute which has the value factor and which makes it behave differently compared to
standard integer values.
2. The levels attribute which specifies a set of admissible integers the vector can have.

A factor is usually created with the function factor.

Factors demo

Fac1 <- factor(c("green", "green", "blue"))

Fac1
[1] green green blue
Levels: blue green
class(Fac1)

22
[1] "factor"
levels(Fac1)
[1] "blue" "green"

Fac1[2] <- "red"

Warning in `[<-.factor`(`*tmp*`, 2, value = "red"): invalid factor level, NA
generated

Fac1 <- factor(c("green", "green", "blue"))

Fac2 <- factor(c("green", "blue", "blue"))
Fac3 <- factor(c("green", "green"))
Fac4 <- c(Fac1, Fac2)
Fac4
[1] green green blue green blue blue
Levels: blue green
Fac5 <- c(Fac1, Fac3)
Fac5
[1] green green blue green green
Levels: blue green

Levels of a factor

Hence all possible values of a factor should be specified, even when they are not all appearing in the observed
vector. This will also often be more informative when analyzing data.

SexCha <- c("male", "male", "male")

SexFac <- factor(SexCha, levels = c("male", "female"))

table(SexCha)
SexCha
male
3
table(SexFac)
SexFac
male female
3 0

The function relevel

• In statistics often one group is used as a reference group and all other groups are compared to this
group.
• To achieve this in R the reference group should be the first level of a factor.
• To change the order of the levels, the function relevel should be used.

Examples relating to factors

treat <- factor(rep(c(1, 3), c(2, 4)),

labels = c("DRUG2", "PLACEBO"))

23
treat
[1] DRUG2 DRUG2 PLACEBO PLACEBO PLACEBO PLACEBO
Levels: DRUG2 PLACEBO
treat2 <- factor(rep(c(1, 3), c(2, 4)), levels = 1:3,
labels = c("DRUG2", "DRUG1", "PLACEBO"))
treat2
[1] DRUG2 DRUG2 PLACEBO PLACEBO PLACEBO PLACEBO
Levels: DRUG2 DRUG1 PLACEBO
treat3 <- relevel(treat2, ref = "PLACEBO")
treat3
[1] DRUG2 DRUG2 PLACEBO PLACEBO PLACEBO PLACEBO
Levels: PLACEBO DRUG2 DRUG1

Categorizing a numeric vector

Often one observes numeric values for a variable and one wants to categorize it accordingly to its value. This
can easily be done using the function cut.

BMI <- round(rnorm(8, 20, 8), 2)

BMI
[1] 17.98 13.07 24.66 19.90 17.00 22.54 16.09 41.27
BMI.cat.1 <- cut(BMI, c(0, 18.5, 25, 30, 100))
BMI.cat.1
[1] (0,18.5] (0,18.5] (18.5,25] (18.5,25] (0,18.5] (18.5,25] (0,18.5]
[8] (30,100]
Levels: (0,18.5] (18.5,25] (25,30] (30,100]
BMI.cat.2 <- cut(BMI, c(0, 18.5, 25, 30, 100),
labels = c("low", "normal", "heavy", "obese"))
BMI.cat.2
[1] low low normal normal low normal low obese
Levels: low normal heavy obese

Arrays and matrices

• Adding a dim attribute to an atomic vector allows it to behave like a multidimensional array.
• A special case of an array is a matrix - there the dimension attribute is of length 2.

• While matrices are an essential part of statistics, arrays are much rarer but are still useful.
• Usually matrices and arrays are not created by modifying atomic vectors but by using the functions
matrix and array.

Examples of matrices and arrays

M1 <- matrix(1:6, ncol = 3, nrow = 2)

M1
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6

24
A1 <- array(1:24, dim = c(3, 4, 2))
A1
, , 1

[,1] [,2] [,3] [,4]

[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12

, , 2

[,1] [,2] [,3] [,4]

[1,] 13 16 19 22
[2,] 14 17 20 23
[3,] 15 18 21 24

Length and names for matrices

• Naturally, also the “length” attribute of a matrix is then two-dimensional. The corresponding functions
are ncol and nrow.

• Similarly “names” has the two version colnames and rownames.

ncol(M1)
[1] 3
nrow(M1)
[1] 2
colnames(M1) <- LETTERS[1:3]
rownames(M1) <- letters[1:2]
M1
A B C
a 1 3 5
b 2 4 6
rownames(M1)
[1] "a" "b"
length(M1) ## number of elements in matrix!
[1] 6
c(M1) ## columns are apended in an atomic vector
[1] 1 2 3 4 5 6

Length and names for arrays

The counterpart of length for an array is dim and the counterpart of names is dimnames which is list of
character vectors of appropriate length.

dim(A1)
[1] 3 4 2
dimnames(A1) <- list(c("r1", "r2", "r3"), c("c1", "c2", "c3", "c4"),
c("a1", "a2"))
A1
, , a1

25
c1 c2 c3 c4
r1 1 4 7 10
r2 2 5 8 11
r3 3 6 9 12

, , a2

c1 c2 c3 c4
r1 13 16 19 22
r2 14 17 20 23
r3 15 18 21 24

Useful functions in the context for matrices and arrays

• The extension for c for matrices is cbind and rbind. Similarly the package abind provides the function
abind.
• For transposing a matrix in R the function t is available and for the array counterpart the function
aperm.
• To check if an object is a matrix / array the functions is.matrix / is.array can be used.
• Similarly coercion to matrices and arrays can be performed using as.matrix / as.array.

Data frames

• Data frames are in R the most common structures to store data.

• Internally it is the same as a list of equal length vectors.
• It has however also a similar structure as a matrix.
• Hence it shares properties from both types.
• For example the function length returns for a data frame the number of list components, which is the
number of columns and hence the same as ncol. While nrow returns the the number of rows.
• Following the same reasoning, names gives the names of the vectors which is the same as colnames.
rownames in turn gives the row names.

Data frame creation

The function data.frame can be used to create data frames. Since R 4.0.0 it does not by default convert
character vectors to factors anymore.

DF1 <- data.frame(V1 = 1:5,

V2 = c("a", "a", "b", "a", "d"))
str(DF1)
'data.frame': 5 obs. of 2 variables:
$ V1: int 1 2 3 4 5
$ V2: chr "a" "a" "b" "a" ...

Most functions which read external data into R also return a data frame.

26
stringAsFactors

Note that argument stringAsFactors = TRUE provides the old behaviour of automatic conversion.

DF2 <- data.frame(V1 = 1:5,

V2 = c("a", "a", "b", "a", "d"),
stringsAsFactors = TRUE)
str(DF2)
'data.frame': 5 obs. of 2 variables:
$ V1: int 1 2 3 4 5
$ V2: Factor w/ 3 levels "a","b","d": 1 1 2 1 3

This can also be controlled globally by using

options(stringAsFactors = TRUE)

More on data frames

Basically a data frame is a list with an S3 class attribute. So “checks” of a data frame yield:

typeof(DF1)
[1] "list"
class(DF1)
[1] "data.frame"
is.data.frame(DF1)
[1] TRUE

Coercion to data frames I

Lists, vectors and matrices can be coerced to data frames if it is appropriate. For lists this means that all
objects have the same “length”.

V1 <- 1:5
L1 <- list(V1 = V1, V2 = letters[c(1, 2, 3, 2, 1)])
L2 <- list(V1 = V1, V2 = letters[c(1, 2, 3, 2, 1, 3)])
str(as.data.frame(V1))
'data.frame': 5 obs. of 1 variable:
$ V1: int 1 2 3 4 5

Coercion to data frames II

str(as.data.frame(M1))
'data.frame': 2 obs. of 3 variables:
$ A: int 1 2
$ B: int 3 4
$ C: int 5 6
str(as.data.frame(L1))
'data.frame': 5 obs. of 2 variables:
$ V1: int 1 2 3 4 5

27
$ V2: chr "a" "b" "c" "b" ...
str(as.data.frame(L2))
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply dif

Combining data frames

• The basic functions to combine two data frames (works similar with matrices) are cbind and rbind.
• When combining column-wise, then the numbers of rows must match and row names are ignored (hence
observations need to be in the same order).
• When combining row-wise the number of columns and their names must match.
• For more advanced combining see the function merge.

Combining data frames examples

cbind(DF1, data.frame(new = 6:10))

V1 V2 new
1 1 a 6
2 2 a 7
3 3 b 8
4 4 a 9
5 5 d 10
rbind(DF1, data.frame(V1 = 1, V2 = "c"))
V1 V2
1 1 a
2 2 a
3 3 b
4 4 a
5 5 d
6 1 c

More about cbind

Note that cbind (and rbind) try to make matrices when possible. Only if at least one of the elements to be
combined is a data frame the results will be also a data.frame.
Hence vectors can’t usually be combined into a data frame using cbind.

V1 <- 1:3
V2 <- c("a", "b", "a")
str(cbind(V1, V2))
chr [1:3, 1:2] "1" "2" "3" "a" "b" "a"
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:2] "V1" "V2"

28
Special columns in a data frame

• Since a data frame is list of vectors it is possible to have a list as a column.

• However, when a list is given to the data.frame function it usually fails as the function tries to put
each list item into its own column.
• A workaround is to use the function I which is a protector function and says something should be
treated as is.

• More common than adding a list is to add a matrix to a data frame - also here should the protector
function I be used.

Special columns in a data frame examples

DF4 <- data.frame(a = 1:3)

# works:
DF4$b <- list(1:2,1:3,1:4)
DF4
a b
1 1 1, 2
2 2 1, 2, 3
3 3 1, 2, 3, 4
# does not work
DF5 <- data.frame(a = 1:3, b = list(1:2,1:3,1:4))
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply dif
# does work
DF6 <- data.frame(a = 1:3, b = I(list(1:2,1:3,1:4)))
DF6
a b
1 1 1, 2
2 2 1, 2, 3
3 3 1, 2, 3, 4

DF6 <- data.frame(a = 1:3, b = I(matrix(1:6, nrow = 3)))

DF6
a b.1 b.2
1 1 1 4
2 2 2 5
3 3 3 6
str(DF6)
'data.frame': 3 obs. of 2 variables:
$ a: int 1 2 3
$ b: 'AsIs' int [1:3, 1:2] 1 2 3 4 5 6

Subsetting data structures in R

To work with data subsetting is a key feature. R is really flexible in this regard and has many different
ways to subset the different data structures.
In the following we will discuss the main ways for the main data structures.

29
Subsetting atomic vectors

We will start subsetting atomic vectors as subsetting other structures is quite similar.
There are six ways to subset an atomic vector:

1. positive indexing using positive integers.

2. negative indexing using negative integers.
3. logical indexing using logical vectors.
4. named indexing using character vectors.
5. blank indexing.
6. zero indexing.

Positive indexing of atomic vectors

Specifying in square brackets the position of the elements which should be selected.

V1 <- c(1, 3, 2.5, 7.2, -3.2)

# basic version
V1[c(1, 3)]
[1] 1.0 2.5
# same elements can be selected multiple times
V1[c(1, 3, 1, 3, 1, 3, 1)]
[1] 1.0 2.5 1.0 2.5 1.0 2.5 1.0
# double valued indices are truncated to integers
V1[c(1.1, 3.9)]
[1] 1.0 2.5

Negative indexing of atomic vectors

Specifying in square brackets the positions of the elements which should not be selected.

V1[-c(2, 4, 5)]
[1] 1.0 2.5

Note that positive and negative indexing cannot be combined:

V1[c(-1, 2)]
Error in V1[c(-1, 2)]: only 0's may be mixed with negative subscripts

Logical indexing of atomic vectors

Giving in square brackets a logical vector of the same length means that the elements with value TRUE will
be selected.

# basic version
V1[c(TRUE, FALSE, TRUE, FALSE, FALSE)]
[1] 1.0 2.5
# if the logical vector is too short,
# it will be recycled.

30
V1[c(TRUE, FALSE, TRUE)]
[1] 1.0 2.5 7.2
# most common is to use expression
# which return a logical vector
V1[V1 < 3]
[1] 1.0 2.5 -3.2

Named indexing of atomic vectors

Giving in square brackets a character vector of the names which should be selected.

names(V1) <- letters[1:5]

# basic version
V1[c("a", "c")]
a c
1.0 2.5
# same elements can be selected multiple times
V1[c("a", "c", "a", "c", "a", "c", "a")]
a c a c a c a
1.0 2.5 1.0 2.5 1.0 2.5 1.0
# names are matched exactly
V1[c("a", "c", "ab", "z")]
a c <NA> <NA>
1.0 2.5 NA NA

Blank and zero indexing of atomic vectors

Blank indexing is not useful for atomic vectors but will be relevant for higher dimensional objects. It
returns in this case the original atomic vector.

V1[]
a b c d e
1.0 3.0 2.5 7.2 -3.2

Zero indexing returns in this case a zero length vector. It is often used when generating testing data.

V1[0]
named numeric(0)

Indexing lists

Lists are in general subset quite like atomic vectors. There are however more operators available for subset-
ting:

1. [ ([ ])
2. [[ ([[ ]])
3. $

The first one returns always a list, the other two options extract list components (details will follow later).

31
L1 <- list(a = 1:2, b = letters[1:3], c = c(TRUE, FALSE))
L1[1]
$a
[1] 1 2
L1[[1]]
[1] 1 2
L1$a
[1] 1 2

Indexing matrices and arrays

Subsetting of higher dimensional objects can be done in three ways:

1. using multiple vectors.

2. using a single vector.
3. using matrices

The most common way is to generalize the atomic vector subsetting to higher dimension by using one of the
six methods described earlier for each dimension.
Here then especially the blank indexing becomes relevant.
We will focus here on matrices, but arrays work basically the same.

Subsetting matrices with two vectors

M1 <- matrix(1:6, ncol = 3)

rownames(M1) <- LETTERS[1:2]
colnames(M1) <- letters[1:3]

M1[c(TRUE, FALSE), c("b", "c")]

b c
3 5

M1[ ,c(1, 1, 2)]

a a b
A 1 1 3
B 2 2 4

M1[-2, ]
a b c
1 3 5

Subsetting matrices with one vector

As matrices (arrays) are essentially vectors with a dimension attribute, also a single vector can be used to
extract elements. For this it is important that matrices (arrays) filled in column major order.

32
M2 <- outer(1:5, 1:5, paste, sep = ",")
M2
[,1] [,2] [,3] [,4] [,5]
[1,] "1,1" "1,2" "1,3" "1,4" "1,5"
[2,] "2,1" "2,2" "2,3" "2,4" "2,5"
[3,] "3,1" "3,2" "3,3" "3,4" "3,5"
[4,] "4,1" "4,2" "4,3" "4,4" "4,5"
[5,] "5,1" "5,2" "5,3" "5,4" "5,5"

M2[c(3, 17)]
[1] "3,1" "2,4"

Subsetting matrices with a matrix

This is rarely done but possible. To select elements from an n-dimensional object, a matrix with n columns
can be used. Each row of the matrix specifies one element. The result will always be a vector. The matrix
can consist of integers or of characters (if the array is named).

M3 <- matrix(ncol = 2, byrow = TRUE,

data = c(1, 4,
3, 3,
5, 1))
M2[M3]
[1] "1,4" "3,3" "5,1"

Subsetting data frames

Recall that data frames are on the one side lists and on the other side similar to matrices.
If a data frame is subset with a single vector it behaves like a list. If subset with two vectors it behaves like
a matrix.

DF1 <- data.frame(a = 4:6, b = 7:5, c = letters[15:17])

DF1[DF1$a <= 5, ]
a b c
1 4 7 o
2 5 6 p
DF1[c(1,3), ]
a b c
1 4 7 o
3 6 5 q

Subsetting data frames II

To select columns:

# like a matrix
DF1[, c("a","c")]
a c
1 4 o

33
2 5 p
3 6 q
# like a list
DF1[c("a","c")]
a c
1 4 o
2 5 p
3 6 q

Subsetting data frames III

The behavior differs, if only one column is selected:

# like a matrix
DF1[, "a"]
[1] 4 5 6
# like a list
DF1[ "a"]
a
1 4
2 5
3 6

Subsetting arbitrary S3 objects

• In general S3 objects consist of atomic vectors, matrices, arrays, lists and so on. And they can be
extracted from the S3 object using the same ways as described above.

• Again, the initial step is to look at str to reveal the details of the object.

Subsetting an S3 object example I

set.seed(1)
x <- runif(1:100)
y <- 3 + 0.5 * x + rnorm(100, sd = 0.1)
fit1 <- lm(y ~ x)
class(fit1)
[1] "lm"

Assume we want to extract individually the three parameters of the model.

Subsetting an S3 object example II

str(fit1)
List of 12
$ coefficients : Named num [1:2] 2.982 0.531
..- attr(*, "names")= chr [1:2] "(Intercept)" "x"

34
$ residuals : Named num [1:100] 0.0495 -0.0549 0.0342 -0.1234 0.1549 ...
..- attr(*, "names")= chr [1:100] "1" "2" "3" "4" ...
$ effects : Named num [1:100] -32.572 1.414 0.028 -0.137 0.157 ...
..- attr(*, "names")= chr [1:100] "(Intercept)" "x" "" "" ...
$ rank : int 2
$ fitted.values: Named num [1:100] 3.12 3.18 3.29 3.46 3.09 ...
..- attr(*, "names")= chr [1:100] "1" "2" "3" "4" ...
$ assign : int [1:2] 0 1
$ qr :List of 5
..$ qr : num [1:100, 1:2] -10 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : chr [1:100] "1" "2" "3" "4" ...
.. .. ..$ : chr [1:2] "(Intercept)" "x"
.. ..- attr(*, "assign")= int [1:2] 0 1
..$ qraux: num [1:2] 1.1 1.05
..$ pivot: int [1:2] 1 2
..$ tol : num 1e-07
..$ rank : int 2
..- attr(*, "class")= chr "qr"
$ df.residual : int 98
$ xlevels : Named list()
$ call : language lm(formula = y ~ x)
$ terms :Classes 'terms', 'formula' language y ~ x
.. ..- attr(*, "variables")= language list(y, x)
.. ..- attr(*, "factors")= int [1:2, 1] 0 1
.. .. ..- attr(*, "dimnames")=List of 2
.. .. .. ..$ : chr [1:2] "y" "x"
.. .. .. ..$ : chr "x"
.. ..- attr(*, "term.labels")= chr "x"
.. ..- attr(*, "order")= int 1
.. ..- attr(*, "intercept")= int 1
.. ..- attr(*, "response")= int 1
.. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
.. ..- attr(*, "predvars")= language list(y, x)
.. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
.. .. ..- attr(*, "names")= chr [1:2] "y" "x"
$ model :'data.frame': 100 obs. of 2 variables:
..$ y: num [1:100] 3.17 3.12 3.32 3.34 3.24 ...
..$ x: num [1:100] 0.266 0.372 0.573 0.908 0.202 ...
..- attr(*, "terms")=Classes 'terms', 'formula' language y ~ x
.. .. ..- attr(*, "variables")= language list(y, x)
.. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
.. .. .. ..- attr(*, "dimnames")=List of 2
.. .. .. .. ..$ : chr [1:2] "y" "x"
.. .. .. .. ..$ : chr "x"
.. .. ..- attr(*, "term.labels")= chr "x"
.. .. ..- attr(*, "order")= int 1
.. .. ..- attr(*, "intercept")= int 1
.. .. ..- attr(*, "response")= int 1
.. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
.. .. ..- attr(*, "predvars")= language list(y, x)
.. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
.. .. .. ..- attr(*, "names")= chr [1:2] "y" "x"

35
- attr(*, "class")= chr "lm"

Subsetting an S3 object example III

# the intercept
fit1$coefficients[1]
(Intercept)
2.982
# the slope
fit1$coefficients[2]
x
0.5312
# sigma needs to be computed
sqrt(sum((fit1$residuals-mean(fit1$residuals))ˆ2)
/fit1$df.residual)
[1] 0.09411

Subsetting an S3 object example IV

The variance is actually done by summary.lm:

summary(fit1)

Call:
lm(formula = y ~ x)

Residuals:
Min 1Q Median 3Q Max
-0.18498 -0.05622 -0.00871 0.05243 0.25166

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.9821 0.0206 145 <2e-16 ***
x 0.5312 0.0353 15 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.0941 on 98 degrees of freedom

Multiple R-squared: 0.697, Adjusted R-squared: 0.694
F-statistic: 226 on 1 and 98 DF, p-value: <2e-16

Subsetting arbitrary S4 objects

S4 objects have their own subsetting operators.

• the $ operator is replaced by @.

• the [[ operator is replaced by the function slot.

These operators are much more restrictive than their standard counterparts.

36
More on standard subsetting operators

We used already the operators [[ and $ which are frequently used when extracting parts from lists and other
objects.

• [[ is similar to [, but in can only extract single a value/component. Hence only positive integers or a
strings can be used in combination with [[.
• $ is a shorthand for [[ when the component is named.

These operators are mainly used in the context of lists and the difference is that [ returns always a list and
[[ gives the content of the list.

Examples for [[ and $ I

str(L1)
List of 3
$ a: int [1:2] 1 2
$ b: chr [1:3] "a" "b" "c"
$ c: logi [1:2] TRUE FALSE
L1[[1]]
[1] 1 2
L1[1]
$a
[1] 1 2
L1$a
[1] 1 2
str(L1[[1]])
int [1:2] 1 2
str(L1[1])
List of 1
$ a: int [1:2] 1 2
str(L1$a)
int [1:2] 1 2

Examples for [[ and $ II

If [[ is used with a vector of integers or characters then it is assuming nested list structures.

L2 <- list(a = list(A = list(aA = 1:3, bB = 4:6),

B = "this"),
b = "that")
str(L2)
List of 2
$ a:List of 2
..$ A:List of 2
.. ..$ aA: int [1:3] 1 2 3
.. ..$ bB: int [1:3] 4 5 6
..$ B: chr "this"
$ b: chr "that"
L2[[c("a","A","aA")]]
[1] 1 2 3

37
Simplification vs preservation

As the different subsetting operators have different properties simplifying or preservation needs to be
remembered at all times as it can have huge impact in programming.
In doubt it is usually better not to simplify. As it is then better that an object is always of the type it was
originally.
To prevent or force simplification, the argument drop can be specified in [.

Details about simplification vs preservation

Simplification Preservation
vector x[[1]] x[1]
list x[[1]] x[1]
factor x[ind, drop=TRUE] x[1]
matrix x[1,] or x[,1] x[ind, , drop = FALSE]
or x[,ind, drop = FALSE]
data frame x[,1] or x[[1]] x[ , 1, drop = FALSE] or x[1]

here ind is an indexing vector of positive integers and naturally arrays behave the “same” as matrices.

What does simplification mean for atomic vectors?

Simplification for atomic vectors concerns the loss of names.

V1 <- c(a=1, b=2, c=3)

V1[1]
a
1
V1[[1]]
[1] 1

What does simplification mean for lists?

Simplification for lists concerns if the result has to be a list or can be of the type of the extracted object.

L1 <- list(a = 1, b = 2:3, c = "a")

str(L1[1])
List of 1
$ a: num 1
str(V1[[1]])
num 1

What does simplification mean for factors?

Simplification for factors mean that unused levels are dropped.

38
F1 <- factor(c("a", "b", "a"),
levels = c("a","b","c"))
F1
[1] a b a
Levels: a b c
F1[1]
[1] a
Levels: a b c
F1[1, drop = TRUE]
[1] a
Levels: a
droplevels(F1)
[1] a b a
Levels: a b

What does simplification mean for matrices?

Simplification for matrices concerns the loss of a dimension.

M1 <- matrix(1:6, nrow=3)

M1[, 1 , drop = FALSE]
[,1]
[1,] 1
[2,] 2
[3,] 3
M1[, 1]
[1] 1 2 3

What does simplification mean for matrices? II

Simplification for matrices concerns the loss of a dimension.

A1 <- array(1:12, dim = c(2, 3, 2))

A1[ , , 1, drop = FALSE]
, , 1

[,1] [,2] [,3]

[1,] 1 3 5
[2,] 2 4 6
dim(A1[ , , 1, drop = FALSE])
[1] 2 3 1
A1[ , , 1]
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
dim(A1[ , , 1])
[1] 2 3

What does simplification mean for data frames?

Simplification for data frames means single columns are returned as vectors and not as data frames.

39
DF1 <- data.frame(a = 1:2, b = letters[1:2])
str(DF1[1])
'data.frame': 2 obs. of 1 variable:
$ a: int 1 2
str(DF1[[1]])
int [1:2] 1 2
str(DF1[ , "a", drop=FALSE])
'data.frame': 2 obs. of 1 variable:
$ a: int 1 2
str(DF1[ , "a"])
int [1:2] 1 2

Basically x$n is equivalent to x[["n", exact = FALSE]].

It is usually used to extract variables from a data frame.
Note that cannot be used to extract variables using stored variable names:

DF1 <- data.frame(a = 1:2, b = letters[1:2])

name.a <- "a"
DF1$name.a
NULL
DF1[[name.a]]
[1] 1 2

Another difference between $ and [[ is that $ does partial matching.

DF1 <- data.frame(aaa=1:2, bbb=letters[1:2])

DF1$a
[1] 1 2
DF1[["a"]]
NULL

Subsetting and assignment

All subsetting operators can be combined with assigning values to the selected parts.

x <- 1:6
x
[1] 1 2 3 4 5 6
x[1] <- 20
x
[1] 20 2 3 4 5 6
x[-1] <- 51:55
x
[1] 20 51 52 53 54 55
x[c(1,1)] <- c(-10,-20)

40
x
[1] -20 51 52 53 54 55
## Logical & NA indexing can be combined!
## It is be recycled:
x[c(TRUE,FALSE,NA)] <- 1
x
[1] 1 51 52 1 54 55

Data storage

Approximate storage of numbers

• While it is possible for a computer to store numbers exactly, it is more common to use approximate
representations.
• R uses double precision floating point numbers for its numeric computations.

• E.g., 123.45 is a decimal floating point number everyone understands to be the same as: 123.45 =
1 · 102 + 2 · 101 + 3 · 100 + 4 · 10−1 + 5 · 10−2 .
• One can also write this as 123.45 = 12345 · 10−2 = 1.2345 · 102 (last is normalized form).
• The sequence of (here, decimal) digits 12345 is called the significand (or mantissa), the 2 is the exponent
(or characteristic) of the number.

Floating point number systems

• A floating point number system is characterized by four integers: b (base or radix), p (precision), and
emin and emax (minimal and maximal exponents).

• It consists of numbers of the form:

δ1 δp−1
x = ± δ0 + 1 + . . . + p−1 be
b b

where emin ≤ e ≤ emax , δi ∈ {0, . . . , b − 1}. The number is normalized if δ0 = 1.

• In binary system, b = 2 and the digits are δi ∈ {0, 1}.

IEEE 754 (I)

• Clearly, all floating point numbers can be represented by the triple (sign, exponent, significand).
• IEEE 754 is a standard for base 2 which says: for double precision, use 64 bits (8 bytes) overall, split
as sign: 1 bit, exponent: 11 bits, significand: 52 bits.

• In principle, the exponent is represented using the biased scheme biased scheme.

– Note: in this scheme, for k = 11 bits, the representation number βk−1 βk−2 ...β0 corresponds to
Pk−1
e = i=0 βi 2i − (210 − 1).
– So the exponent range would be −1023, −1022, . . . , 1023, 1024.

• But, the smallest (all zero’s) and largest (all one’s) exponents are special!

41
• Representing binary floating point numbers in IEEE 754 works as follows:

(a) Exponent neither all 0 bits or all 1 bits: this is the normalized number

δ1 δ52
± 1+ + . . . + 52 2e
2 2
(b) Exponent all 0 bits: this is the de-normalized number

δ1 δ52
± 0+ + . . . + 52 2−1022
2 2
(c) Exponent all 1 bits: if all bits in the significand are 0, this is ±∞; otherwise, it is a NaN.

IEEE 754: Examples

• The standard layout for the double precision representation is
σ ϵ10 , ϵ9 , . . . ϵ0 δ1 , . . . δ52

• Question: Which IEEE 754 floating point number does the following correspond to?
σ 11 . . . 1 0 . . . 0

• Answer: all exponents are 1 and all significands are zero, so (c) on previous slide applies: number is
±∞.
• Question: Which IEEE 754 floating point number does the following correspond to?
σ 00 . . . 0 0 . . . 0

• Answer: all exponents are 0 so this is a denormalized number (see (b) on previous slide) which has
δ0 = 0 and
0 0
± 0 + + . . . + 52 2−1022 = 0
2 2
• Note: This is how we get two zeros (because of the sign bit).
• Question: what is the smallest positive normalized number we can do?
• Answer:
– the exponent should be as small as possible: 000. . . 001 (all zero’s does not work, as the number
is normalized).
– the significand should be as small as possible: 000. . . 000.

0 0
+ 1 + + . . . + 52 2−1022 = 2−1022
2 2

• Question: what is the largest positive denormalized number we can do?

• Answer:
– the exponent should be: 000. . . 000 (as the number is denormalized).
– the significand should be as large as possible: 111. . . 111.

52
1 1 −1022
X
+ 0 + + . . . + 52 2 = 2−i 2−1022 = 2−1022 (1 − 2−52 )
2 2 i=1

42
Rounding effects

• The maximal precision we can expect for floating point computations in R is 16 decimal digits after
the comma (52 binary digits).
• So the basic rule 1 + x = 1 ⇒ x = 0 does not hold in floating point arithmetic! (2−52 is the smallest
positive number greater than 1).

x <- 2ˆ(-52)
1 + x == 1

[1] FALSE

x <- 2ˆ(-53)
1 + x == 1

[1] TRUE

Rounding effects: Example 1

• Consider 5/4 and 4/5. In decimal notation these can be exactly represented as 1.25 and 0.8.
• In binary notation:

– 5/4 = 1.01 can be exactly represented.

– 4/5 = 0.110011001100... cannot be exactly represented. Some rounding error will occur in the
storage.

• For example we know that 5/4 · (n · 4/5) = n. But in R. . .

n <- 1:10
1.25 * (n * 0.8) == n

[1] TRUE TRUE FALSE TRUE TRUE FALSE FALSE TRUE TRUE TRUE

• To avoid issues:

all.equal(1.25 * (n * 0.8), n)

[1] TRUE

Rounding effects: Example 2

• Rounding errors tend to accumulate so a long series of calculations will result in larger errors than a
shorter one.

# Three ways of computing the variance:

x <- 1:11
mean(x)

[1] 6

43
var(x) # built in

[1] 11

sum((x - mean(x))ˆ2)/10

[1] 11

(sum(xˆ2) - 11 * mean(x)ˆ2) /10

[1] 11

Assume we add a large value to x:

# Three ways of computing the variance:

x <- 1:11 + 10ˆ10
var(x) # built in

[1] 11

sum((x - mean(x))ˆ2)/10

[1] 11

(sum(xˆ2) - 11 * mean(x)ˆ2) /10 # Oh No...

[1] -13107

Integer storage

• In R, k = 32 bits (4 bytes) are used for integers.

• For general k, there are 2k such sequences.
• Which numbers should these bit sequences correspond to?
• Obvious idea: the numbers with binary representation given by the respective bit sequences. I.e., for
k = 3 the following would give all 8 integers between 0 and 7.

000: 0 * 2ˆ2 + 0 * 2ˆ1 + 0 * 2ˆ0 = 0

001: 0 * 2ˆ2 + 0 * 2ˆ1 + 1 * 2ˆ0 = 1
...
111: 1 * 2ˆ2 + 1 * 2ˆ1 + 1 * 2ˆ0 = 7

• But what about negative numbers?

• Solutions: sign and magnitude, bias, two’s complement
• R uses k = 32 bits two’s complement with one modification: 10...0 ⇐⇒ NA_integer_ (the integer
missing value).

44
• So the 232 = 4294967296 bit sequences have one zero, one NA, and (232 − 2)/2 = 231 − 1 = 2147483647
positive and negative integers each.
• The smallest such integer is −(231 − 1), the largest is 231 − 1.

as.integer(2 ˆ 31 - 1) # works

[1] 2147483647

as.integer(2 ˆ 31) # not anymore ...

Warning: NAs introduced by coercion to integer range

[1] NA

Flow control

• Many problems are often of a repetitive nature and solutions are not obtained in a single instance but
certain steps need to be repeated.
• For example

– In simulations usually certain procedures need to be repeated a fixed number of times.

– In algorithms the steps need to be repeated until some convergence criterion is reached.

• For this flow control R offers different constructs which we will introduce in the following slides.

for loop

• The for() statement in R specifies that certain statements are to be repeated a fixed number of times.
• The syntax looks like:

for (index in vector) {

statements
}

• This means that the variable index runs through all elements in vector. For each value then in vector
the statements are executed.

• If for each value a result is created which should be stored, then it is recommended to create first an
object of the appropriate length which is used to store the results.

Fibonacci numbers

To compute in R the first 10 Fibonacci numbers we can use a for loop in the following way:

45
Fib <- numeric(10) ## create a vector which will store numeric elements
Fib[1] <- 1
Fib[2] <- 1

for (i in 3:10) {
Fib[i] <- Fib[i-1] + Fib[i-2]
}
Fib

[1] 1 1 2 3 5 8 13 21 34 55

if statement
• The if statement can be used to control whether and when certain statements are to be executed.
• There are two versions:

if (condition) {
statements when condition is TRUE
}

if (condition){
statements when TRUE
} else {
statements when FALSE
}

if else example

x <- 3
if (x < 5) {
print("'x' is smaller than 5")
} else {
print("'x' is at least 5")
}

[1] "'x' is smaller than 5"

while loop
• The while loop can be used when statements have to be repeated but is not known in advance how
often exactly. The computations should be continued as long as a condition is fullfilled.
• The syntax looks like:

while (condition) {
statements
}

• Hence here condition is evaluated and if FALSE nothing will be done. If the condition is however TRUE,
then the statements are executed. After the statements are executed, the condition is again evaluated.

46
Fibonacci numbers II

To compute for example all Fibonacci numbers smaller than 100 we could use

Fib1 <- 1
Fib2 <- 1
Fibs <- c(Fib1)
while (Fib2 < 100) {
Fibs <- c(Fibs, Fib2)
oldFib2 <- Fib2
Fib2 <- Fib1 + Fib2
Fib1 <- oldFib2
}
Fibs

[1] 1 1 2 3 5 8 13 21 34 55 89

Note: increasing the length of a vector can be costly for R! Avoid if possible.

repeat loop

• If a loop is needed which does not go through a prespecified number of iterations or should not have
a condition check at the top the repeat loop can be used.
• The syntax looks like:

repeat {
statements
}

• This causes the statement to be repeated endlessly. Therefore a terminator called break needs to be
included. It is usually included as:

if (condition) break

break and next

• In general the break command can be used in any loop and it causes the loop to terminate immediately.
• Similarly, the command next can also be used in any loop and causes that the computations of the
current iteration are terminated immediately and the next iteration is started from the top.

• The repeat loop and the functions break and next are rarely used since it is much easier to read and
understand programs using the other looping methods.

Fibonacci numbers III

To compute for example all Fibonacci numbers smaller than 100 we could use also

47
Fib1r <- 1
Fib2r <- 1
Fibsr <- c(Fib1r)

repeat {
Fibsr <- c(Fibsr, Fib2r)
oldFib2r <- Fib2r
Fib2r <- Fib1r + Fib2r
Fib1r <- oldFib2r
if (Fib2r > 100) break
}
Fibsr

[1] 1 1 2 3 5 8 13 21 34 55 89

switch

• Another possibility for conditional execution is the function switch. It is especially useful when the
there are more than two possibilities or if the options are named.
• The basic syntax is

switch(EXPR, options)

where EXPR can be an integer value which says which option should be chosen, alternatively it can be a
character string if the options are named.

switch examples I

R1 <- switch(1, a = 11, b = 12, cc = 13, d = 14)

[1] 11

R2 <- switch("b", a = 11, b = 12, cc = 13, d = 14)

[1] 12

R3 <- switch("c", a = 11, b = 12, cc = 13, d = 14)

NULL

switch examples II

48
SUM <- function(x, type = "L2") {
switch(type,
L2 = {LOC <- mean(x)
SCA <- sd(x)},
L1 = {LOC <- median(x)
SCA <- mad(x)}
)
return(data.frame(LOC = LOC, SCA = SCA))
}
set.seed(1); x <- rnorm(100)
SUM(x)

LOC SCA
1 0.1089 0.8982

SUM(x, type = "L1")

LOC SCA
1 0.1139 0.87

Conditional element selection

• A function not directly connected to the previous flow control but still useful is ifelse.
• The basic syntax is

ifelse(EXPR, yes, no)

• This function is usually used when EXPR is a vector. The result is a vector of same length as EXPR that
has as corresponding entry the value of yes if EXPR is TRUE, of no if EXPR is FALSE. Missing values in
EXPR remain missing values.
• Note that ifelse will try to coerce EXPR to logical if it is not. Also the attributes from EXPR will be
kept and only the entries replaced.

ifelse example

ifelse(1:4 < 2.5, "yes", "no")

[1] "yes" "yes" "no" "no"

currency <- factor(c("dollar","euro","euro","dollar"))

amount <- rep(100, 4)
amountEuro <- ifelse(currency=="dollar",
amount * 0.85, amount)
amountEuro

[1] 85 100 100 85

49
R functions

About objects and functions

As John Chambers (creator of S) put it:

Everything that exists is an object.

Everything that happens is a function call.

Functions in R

• Functions are fundamental building blocks in R and are self contained units of code with a well-
defined purpose.
• To create a function function() is used. The parentheses enclose the arguments list. Then a single
statement or multiple statements enclosed by {} are specified.
• When R executes a function definition it produces an object with three parts:

1. body: the code inside the function.

2. formals: the list of arguments which specify how to call the function.
3. environment: a guide to where variables of the function are located.

When printing the function it will display these parts. (If the environment is not shown it is the global
environment)

Components of functions: Example I

To reduce the burden for the user, one can give default values to some arguments:

f <- function(x, y = 1) {
z <- x + y
2 * z
}
f

function(x, y = 1) {
z <- x + y
2 * z
}

Components of functions: Example I

formals(f)

$y
[1] 1

50
body(f)

{
z <- x + y
2 * z
}

environment(f)

<environment: R_GlobalEnv>

Primitive functions

• There is one exception of a group of functions which have not the three parts just described - these
are called Primitive functions.
• All primitive functions are located in the base package. They call directly C code and do not contain
any R code.

sum
function (..., na.rm = FALSE) .Primitive("sum")
formals(sum)
NULL
body(sum)
NULL
environment(sum)
NULL

Every operation in R is a function call

• Really, every operation in R is a function call.

• So also +, -, *, [, $, {, for. . . are functions.

• To demonstrate how some operators are actually functions check the following code:

x <- 10
y <- 20

x + y
[1] 30

'+'(x, y)
[1] 30

Scope of variables

• The scope of a variable tells us where the variable would be recognized.

• E.g. Variables defined within functions have local scope and are only recognized within the function.

51
• In R scope is controlled by the environment of the functions.

– Variables defined in console have global scope.

– Variables defined in functions are visible in the function and in functions defined within in.

• Using local variables instead of global ones is less prone to bugs.

• Also packages in R have their own environment (known as namespace)

Scope of variables: Example 1

f <- function(x, y = 1) {
z <- x + y
2 * z
}
z

[1] 2 2 2 1 1 1 1 1

Lazy evaluation

• In the standard case, R arguments are lazy - they are only evaluated when they are actually used.
• To force an evaluation you have to use the function force.
• This also allows us to specify default values in the header of the function for variables which are created
locally.

Lazy evaluation examples

f1 <- function(x) 10
f2 <- function(x) {
force(x)
10
}
f1(stop("You made an error!"))
[1] 10
f2(stop("You made an error!"))
Error in force(x): You made an error!

Calling functions

There are different ways to call functions:

1. Named argument call: Arguments are matched by exact names.

2. Partially named argument call: Arguments are matched using the shortest unique string.
3. Positioning argument call: using the position of the arguments in the function definition.

The three different ways can also be mixed in a function call.

Then R uses first named matching, then partial named matching and finally position matching.

52
Calling functions examples