0% found this document useful (0 votes)
58 views162 pages

Computerstatistik Skriptum

This document provides an introduction to the statistical software R. It describes what R is, how it can be obtained, its basic functionality for calculations and graphics, and popular integrated development environments for using R like RStudio. The document also provides a short example of generating and summarizing simulated data in R for regression analysis.

Uploaded by

Agaliev
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views162 pages

Computerstatistik Skriptum

This document provides an introduction to the statistical software R. It describes what R is, how it can be obtained, its basic functionality for calculations and graphics, and popular integrated development environments for using R like RStudio. The document also provides a short example of generating and summarizing simulated data in R for regression analysis.

Uploaded by

Agaliev
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 162

107.

258 Computerstatistik: Script

Laura Vana Gür

Last compiled on 20 January, 2022

Contents
Introduction 2

Data structures and subsetting in R 12

Data storage 41

Flow control 45

R functions 50

Basic statistics in R 55

Basic graphs with R 60

Basic data handling in R 71

Further R topics 81

Introduction to regression modeling in R 91

Computational linear algebra in R 130

Computational approaches to hypothesis testing 139

Numerical optimization and root finding in R 154


Disclaimer: This script is based on a collection of slides used in the Winter Semester 2021/2022 course.
Most of the materials have been adopted and adapted from the course developed by Klaus Nordhausen.

1
Introduction

What is R

• R was developed by Ross Ihaka and Robert Gentleman (the “R & R’s” of the University of Auckland).

• Ihaka, R., Gentleman, R. (1996): R: A language for data analysis and graphics, Journal of Computa-
tional and Graphical Statistics, 5, 299-314.
• R is a environment and language for data manipulation, calculation and graphical display.
• R is a GNU program. This means it is an open source program (as e.g. Linux) and is distributed for
free.
• R is used by more than 2 million users worldwide (according to R Consortium).
• R was originally used by the academic community but it is currently also used by companies like
Google, Pfizer, Microsoft, Bank of America . . .

R communities

• R has local communities worldwide for users to share ideas and learn.
• R events are organized all over the world bringing its users together:
– Conferences (e.g. useR!, WhyR?, eRum)
– R meetups: check out meetup.com

2
R and related languages

• R can be seen as an implementation or dialect of the S language, which was developed at the AT & T
Bell Laboratories by Rick Becker, John Chambers and Allan Wilks.
• The commercial version of S is S-Plus.
• Most programs written in S run unaltered in R, however there are differences.
• Code written in C, C++ or FORTRAN can be run by R too. This is especially useful for
computationally-intensive tasks.

How to get R

• R is available for most operating systems, as e.g. for Unix, Windows, Mac and Linux.
• R can be downloaded from the R homepage http://www.r-project.org

• The R homepage contains besides the download links also information about the R Project and the R
Foundation, as well as a documentation section and links to projects related to R.
• R is available as 32-bit and 64-bit
• R comes normally with 14 base packages and 15 recommended packages

CRAN

• CRAN stands for Comprehensive R Archive Network


• CRAN is a server network that hosts the basic distribution and R add-on packages

• Central server: http://cran.r-project.org

– New R versions are usually released every few weeks.


– current R version: 4.1.2 (Bird Hippie, released on 2021-11-01) as of 2022-01-20

• The R version used in the course is 4.1.2 (as of Winter semester 2021/2022).

R extension packages

• R can be easily extended with more packages, most of them can be downloaded from CRAN too.
Installation and updating of those packages is however also possible with using R itself (18420 are
currently available on CRAN).
• Packages for the analysis and comprehension of genomic data can be downloaded from the Bioconductor
pages (http://www.bioconductor.org).
• but R packages are available form many other sources like R-forge, Github, . . .

Other distributions of R

• As R is open source and published under a GNU license one can make also a own version of R and
distribute it.
• For example Microsoft has Microsoft R Open https://mran.microsoft.com/open
• But there are many others too. We use however here the standard R version from CRAN.

3
What R offers

Among other things R offers:

• an effective data handling and storage facility.


• a suite of operators for calculations on arrays and matrices (R is a vector based language).
• a large, coherent, integrated collection of tools for data analysis.
• graphical facilities for data analysis and display.
• powerful tools for communicating results. R packages make it easy to produce html or pdf reports, or
create interactive websites.
• a well-developed, simple and effective programming language.

Therefore is not only a plain statistics software package, but it can be used as one. Most of the standard
statistics and a lot of the latest methodology is available for R.

R screenshot

R console

• R by default has no graphical interface and the so called Console has to be used instead.
• The Console or Command Line Window is the window of R in which one writes the commands and in
which the (non-graphic) output will be shown.
• Commands can be entered after the prompt (>).
• In one row one normally types one command (enter submits the command). If one wants to put more
commands in one row the commands have to be separated by a “;”.
• When a command line stars with a “+” instead of “>”it means that the last submitted command was
not completed and one should finish it now.
• All submitted commands of a session can be recalled with the up and down arrows ↑↓.

4
R as a pocket calculator

In the console we can for example do basic calculations

> 7 + 11
[1] 18
> 57 - 12
[1] 45
> 12 / 3
[1] 4
> 5 * 4
[1] 20
> 2 ˆ 4
[1] 16
> sin(4)
[1] -0.7568025

R editors and IDEs

• Using the R Console can be quite cumbersome, especially for larger projects. An alternative to the
Command Line Window is the usage of editors or IDEs (integrated development environments).
• Editors are stand-alone applications that can be connected to an installed R version and are used for
editing R source code. The commands are typed and via the menu or key combinations the commands
are submitted. The user has here usually the choice to submit one command at the time or several
commands at once.
• IDEs integrate various development tools (editors, compilers, debuggers, etc.) into a single program -
the user does not have to worry about connecting the individual components
• R has only a very basic editor included which can be started form the menu “File” New script.

• Better editors are EMACS together with ESS, Tinn-R or WinEdt together with R-WinEdt.
These editors offer syntax highlighting and sometimes also templates for certain R structures.
• The most popular IDE is currently probably RStudio.

5
RStudio screenshot

RStudio default view

• The main window in RStudio contains five parts: one Menu and four Windows (“Panes”)
• From the drop-down menu RStudio and R can be controlled.
• Pane 1 (top left) - Files and Data: Editing R-Code and view of data sets
• Pane 2 (top right) - Workspace and History:

– Workspace lists all objects in the workspace


– History shows the complete code that was typed or executed in the console.

• Pane 3 (bottom right) - Files, Plots, Packages, Help:

– Files, to manage files


– Plots, to visualise and export graphics
– Packages, to manage extension packages
– Help, to access information and help pages for R functions and datasets

• Pane 4 (bottom left) - Console: Execution of R-Code


• This pane layout (and the pane contents) can be adapted using the options menu.

R: a short statistical example

A more sophisticated example than the previous one will demonstrate some features of R which will be
explained in detail later in the course.

6
> options(digits = 4)
> # setting random seed to get a reproducible example
> set.seed(1)
> # creating data
> eps <- rnorm(100, 0, 0.5)
> eps[1:5]
[1] -0.31323 0.09182 -0.41781 0.79764 0.16475
> group <- factor(rep(1:3, c(30, 40, 30)),
+ labels = c("group 1", "group 2", "group 3"))
> x <- runif(100, 20, 30)
> y <- 3 * x + 4 * as.numeric(group) + eps

> # putting the variables into a dataset as could be


> # observed in reality
> data.ex <- data.frame(y = y, x = x, group = group)
> # looking at the data
> str(data.ex)
'data.frame': 100 obs. of 3 variables:
$ y : num 71.7 70.7 79.1 72.9 69.6 ...
$ x : num 22.7 22.2 25.2 22.7 21.8 ...
$ group: Factor w/ 3 levels "group 1","group 2",..: 1 1 1 1 1 1 1 1 1 1 ...

Summary of the data can be obtained:

> summary(data.ex)
y x group
Min. : 64.7 Min. :20.3 group 1:30
1st Qu.: 74.3 1st Qu.:21.9 group 2:40
Median : 79.5 Median :23.8 group 3:30
Mean : 81.1 Mean :24.4
3rd Qu.: 86.8 3rd Qu.:26.4
Max. :102.0 Max. :29.8

Now some plots:

> plot(data.ex) # plot 1

7
20 22 24 26 28 30

90 100
y

80
70
28

x
24
20

3.0
group

2.0
1.0
70 80 90 100 1.0 1.5 2.0 2.5 3.0

> plot(y ~ group) # plot 2


100
90
y

80
70

group 1 group 2 group 3

group
Build a linear model:

> # fitting a linear model and looking at it


> lm.fit <- lm(y ~ x + group)
> lm.fit

8
Call:
lm(formula = y ~ x + group)

Coefficients:
(Intercept) x groupgroup 2 groupgroup 3
3.77 3.01 4.07 7.98

> # more detailed output


> summary(lm.fit)

Call:
lm(formula = y ~ x + group)

Residuals:
Min 1Q Median 3Q Max
-1.1988 -0.2797 0.0198 0.2792 1.0893

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.7682 0.4288 8.79 6e-14 ***
x 3.0110 0.0169 178.19 <2e-16 ***
groupgroup 2 4.0666 0.1094 37.18 <2e-16 ***
groupgroup 3 7.9754 0.1201 66.38 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.453 on 96 degrees of freedom


Multiple R-squared: 0.997, Adjusted R-squared: 0.997
F-statistic: 1.1e+04 on 3 and 96 DF, p-value: <2e-16

Check some diagnostic plots:

> # diagnostic plots


> par(mfrow = c(2, 2)); plot(lm.fit)

9
Standardized residuals
Residuals vs Fitted Normal Q−Q
61 61
Residuals

2
0.5

−2 0
−1.0

24 14 14 24

70 80 90 100 −2 −1 0 1 2

Fitted values Theoretical Quantiles


Standardized residuals

Standardized residuals
Scale−Location Residuals vs Leverage
61 14
24 70

2
1.0

0
Cook's distance
67
0.0

−3
14

70 80 90 100 0.00 0.02 0.04 0.06 0.08

Fitted values Leverage

What can we notice from the example?

1. Doing statistics with R is mainly calling ready available functions


2. The main assignment operator in R is “<-”
3. Assigning results or values produces no output
4. Results can be seen by calling the object or not assigning a function
5. R is object oriented, this means depending on the type of input functions perform different tasks
(E.g. the functions plot() or summary)
6. Text or commands after “#” are ignored by R (can be used for commenting code)

Help for using R

On first sight R looks a bit difficult but already with a few basic commands statistical analyses can be done.
To learn about those commands several sources are available:

• Online manuals and tutorials


• Books
• R’s inbuild help systems and example and demo features
• R has its own journal, the R Journal (earlier called R Newsletter) where different topics are explained.
– For example how to handle date formats in R would be in Newsletter 4(1) 2004, pp. 29-32.
• Add on packages are often described in journal articles usually published in the free online journal
Journal of Statistical Software.

Manuals and Tutorials for R

• On the R homepage one can find the official manuals under Documentation -> Manuals. Especially
the “An Introduction to R” Manual is recommended.
• “Unofficial” tutorials and manuals, also in other languages than English can be found also on the R
homepage under Documentation -> Other or on CRAN under Documentation -> Contributed. Very
useful from here is the R reference card by Tom Short.

10
R Tutorials for SAS, Stata or SPSS users

A lot of new R users are familiar with SAS, Stata and/or SPSS and therefore special charts for an overview
how to do things they used to do in SAS, Stata or SPSS can be done in R and a extended manual for an
easier move to R are available.
The following references might then be helpful:

• http://r4stats.com
• Muenchen, R.A. (2008): R for SAS and SPSS Users
• Muenchen, R.A. and Hilbe, J. (2010): R for Stata Users

Help within R

• There are three type of help types available in R. They can be accessed via the menu or the command
line. Here only the command line versions will be explained
• Using an internet browser:
> help.start() will evoke an internet browser with links to manuals, FAQs the help pages off all
functions sorted by packages together with an search engine.
• The help command:
> help(command) will show command. A shorter version that does the same is > ?command. For a
few special commands the help works only when the command is quoted, e.g. > help("if")
• The help.search command
> help.search("keyword") one can search all titles and aliases of the help files for keywords. A
shorter version that does the same is > ??keyword. This is however not a full text search.

There are also three other functions useful to learn about functions.

• apropos: apropos("string") searches all functions that have the string in their function name
• demo: The demo function runs some available scripts to demonstrate their usage. To see which topics
have a demo script submit > demo()
• example: > example(topic) runs all example codes from the help files that belong to the topic topic
or use the function topic.
• Also in case you remember the beginning of a function or are just lazy - R has also an auto completion
feature. If you start typing a command and hit tab R will complete the command if there are no
alternatives or will you give all the alternatives.

Mailing lists for R

• R as one of the main statistical software programs has several mailing lists. There are general mailing
lists or lists of special interest groups like a list for mixed effects models or robust statistics (for details
see the R homepage).
• The general mailing list is R-help where questions are normally answered pretty quickly. But make
sure to read the posting guide before you ask something yourself! The R-help mails are also archived
and can be searched.
• Using on the R homepage the search-link will lead to more information on search resources.
• And last but not least, there is also Stack Overflow.

11
R Markdown

• Mixture of Markdown, a markup language for writing documents in plain text, and “chunks” of code
in R or another programming language.
• Then the input is rendered into a document (aka knitted), R runs the code, automatically collects
printed output and graphics and inserts them in the final document.
• In RStudio it can be created using File -> New File -> R Markdown. A window pop-out where you
can choose the different types of output. Once this is chosen (e.g., a pdf document) a new file will
open with a template.

• The first part of the template is called YAML (Yet Another Markup Language) and contains informa-
tion that will be used when rendering your document.
• The actual document starts after the YAML preamble.

• More information can be found on the RStudio page in TUWEL.

Data structures and subsetting in R

Basic data structures in R

The five most used data structures in R can be categorized using their dimensionality and whether all content
must be of the same type, i.e. if they are homogeneous or heterogeneous.

Homogeneous Heterogeneous
1D vector list
2D matrix data frame
nD array

Scalars as on the previous slide are treated as vectors of length 1. And almost all other types of objects in
R are build upon these five structures.
To understand the structure of an object in R the best is to use

str(object)

12
Vectors in R

The most basic structure is a vector. They come as two different flavors:

• atomic vector
• list

And a vector must have three properties:

• of what type it is (typeof)


• how long it is (length)
• which attributes it has (attributes)

Difference of an atomic vector and a list

• In an atomic vector all elements must be of the same type, whereas in the list the different elements
can be of different types.
• There are four common types for an atomic vector:

– logical
– integer
– double (often refereed to as numeric)
– character

• The basic function to create atomic vectors is c.

The function c()

The most direct way to create a vector is the c function where all values can be the entered. The values are
then concatenated.

object.x <- c(value1, value2, ...)

A single number is also treated like a vector but can be easier assigned to an object:

object.x <- value

Examples of atomic vectors

LogVector <- c(TRUE, FALSE, FALSE, TRUE)


IntVector <- c(1L, 2L, 3L, 4L)
DouVector <- c(1.0, 2.0, 3, 4)
ChaVector <- c("a", "b", "c", "d")

LogVector
[1] TRUE FALSE FALSE TRUE
IntVector
[1] 1 2 3 4

13
DouVector
[1] 1 2 3 4
ChaVector
[1] "a" "b" "c" "d"

Missing values and other special values in R

Missing values in R are specified as NA which is internally a logical vector of length 1.

• If used within c() NA will always be coerced to the correct type of the vector.
• To create NAs of a specific type one can use NA_real_, NA_integer_ or NA_character_.

Inf is infinity. You can have either positive or negative infinity.

> 1 / 0
[1] Inf

NaN means Not a Number. It’s an undefined value.

> 0 / 0
[1] NaN

Checking for the types of a vector

Given a vector it is easy to check which type it is.


The basic function is typeof.
To check for a specific type the “is”-functions can be used:

• is.character
• is.double
• is.integer
• is.logical
• is.atomic

Checking the example vectors

typeof(IntVector)
[1] "integer"
typeof(DouVector)
[1] "double"
is.atomic(IntVector)
[1] TRUE
is.character(IntVector)
[1] FALSE
is.double(IntVector)
[1] FALSE
is.integer(IntVector)

14
[1] TRUE
is.logical(IntVector)
[1] FALSE

The function is.numeric checks if a vector is of type double or integer.

is.numeric(LogVector)
[1] FALSE
is.numeric(IntVector)
[1] TRUE
is.numeric(DouVector)
[1] TRUE
is.numeric(ChaVector)
[1] FALSE

More on data types in R

R has 6 basic data types (the ones shown below + a raw data type used to hold raw bytes).

> x <- "a"


> typeof(x)
[1] "character"

> y <- 1.5


> typeof(y)
[1] "double"

> z <- 1L
> typeof(z)
[1] "integer"

> w <- TRUE


> typeof(w)
[1] "logical"

> k <- 2 + 4i
> typeof(k)
[1] "complex"

Usually, data vectors are not entered by hand in R, but read in as data saved in some other format.
However often vectors with structures are needed and following slides give some useful functions to create
such vectors.

Sequences

To create a vector that has a certain start and ending point and is filled with points that have equal steps
between them, the function seq can be used.

15
x <- seq(from = 0, to = 1, by = 0.2)
x
[1] 0.0 0.2 0.4 0.6 0.8 1.0
y <- seq(length = 6, from = 0, to = 1)
y
[1] 0.0 0.2 0.4 0.6 0.8 1.0
z <- 1:5
z
[1] 1 2 3 4 5

Replications

The function rep can be used to replicate objects in several ways. For details see the help of the function.
Here are some examples

x <- rep(c(2, 1), 3)


x
[1] 2 1 2 1 2 1
y <- rep(c(2, 1), each = 3)
y
[1] 2 2 2 1 1 1
z <- rep(c(2, 1), c(3, 5))
z
[1] 2 2 2 1 1 1 1 1

Vectors with random pattern

The sample function allows us to obtain of random sample of a specified size from certain elements given in
a vector. The following code corresponds to the results of a 6-sided die:

sample(1:6, size = 8, replace = TRUE)

[1] 1 1 3 1 1 6 6 6

Logical operators in R

Logical vectors are usually created by using logical expressions. The logical vector is of the same length as
the original vector and gives elementwise the result for the evaluation of the expression.
The logical operators in R are:

Operator Meaning
‘==‘ =
‘!=‘ ̸ =
‘<‘ <
‘>‘ >
‘>=‘ ≥
‘<=‘ ≤
Two logical expressions L1 and L2 can be combined using:

16
L1 & L2 for L1 and L2
L1 | L2 for L1 or L2
!L1 for the negation of L1
Logical vectors typically created in the following way:

age <- c(42, 45, 67, 55, 37, 73, 77)


older50 <- age > 50
older50
[1] FALSE FALSE TRUE TRUE FALSE TRUE TRUE

When one wants to enter a logical vector TRUE can be abbreviated with T and FALSE with F, this is however
not recommended.

Vector arithmetic

With numeric vectors one normally wants to perform calculations.


When using arithmetic expressions they are usually applied to each element of the vector.
If an expression involves two or more vectors these vectors do not have to be of the same length, the shorter
vectors will be recycled until they have as many elements as the longest vector.
Important expressions here are:

+, -, *, /, ^, log, sin, cos, tan, sqrt,


min, max, range, mean, var

Here is an short example for vector arithmetic and the recycling of vectors

x <- 1:4
x
[1] 1 2 3 4
y <- rep(c(1,2), c(2,4))
y
[1] 1 1 2 2 2 2
x ˆ 2
[1] 1 4 9 16
x + y
Warning in x + y: longer object length is not a multiple of shorter object
length
[1] 2 3 5 6 3 4

Basic operations on character vectors

Taking substrings using substr (alternatively substring can be used but it has slightly different argument):

cols <- c("red", "blue", "magenta", "yellow")


substr(cols, start = 1, stop = 3)

[1] "red" "blu" "mag" "yel"

Building up strings by concatenation within elements using paste:

17
paste(cols, "flowers")

[1] "red flowers" "blue flowers" "magenta flowers" "yellow flowers"

paste(cols, "flowers", sep = "_")

[1] "red_flowers" "blue_flowers" "magenta_flowers" "yellow_flowers"

paste(cols, "flowers", collapse = ", ")

[1] "red flowers, blue flowers, magenta flowers, yellow flowers"

Coercion

• As all elements in an atomic vector must be of the same type it is of course of interest what happens
if they aren’t.
• In that case the different elements will be coerced to the most flexible type.
• The most flexible type is usually character. But for example a logical vector can be coerced to an
integer or double vector where TRUE becomes 1 and FALSE a 0.
• Coercion order: logical -> integer -> double -> (complex) -> character

v1 <- c(1, 2L)


typeof(v1)
[1] "double"
v2 <- c(v1, "a")
typeof(v2)
[1] "character"
v3 <- c(2L, TRUE, TRUE, FALSE)
typeof(v3)
[1] "integer"

• Coercion often happens automatically. Most mathematical functions try to coerce vectors to numeric
vectors. And on the other hand, logical operators try to coerce to a logical vector.
• In most cases if coercion does not work, a warning or error message is returned.
• In programming to avoid coercion to a possibly wrong type the coercion is forced using the “as”-
functions like as.character, as.double, as.numeric,. . .

LogVector
[1] TRUE FALSE FALSE TRUE
sum(ChaVector)
Error in sum(ChaVector): invalid 'type' (character) of argument
as.numeric(LogVector)
[1] 1 0 0 1
ChaVector2 <- c("0", "1", "7")
as.integer(ChaVector2)
[1] 0 1 7

18
ChaVector3 <- c("0", "1", "7", "b")
as.integer(ChaVector3)
Warning: NAs introduced by coercion
[1] 0 1 7 NA

Lists
Lists are different from atomic vectors as their elements do not have to be of the same type.
To construct a list one usually uses list.

List1 <- list(INT = 1L:3L,


LOG = c(FALSE, TRUE),
DOU = DouVector,
CHA = "z")
str(List1)
List of 4
$ INT: int [1:3] 1 2 3
$ LOG: logi [1:2] FALSE TRUE
$ DOU: num [1:4] 1 2 3 4
$ CHA: chr "z"

The number of components of a list can be obtained using length.

length(List1)
[1] 4

To initialize a list with a certain number of components vector can be used.

List2 <- vector("list", 2)


List2
[[1]]
NULL

[[2]]
NULL

Combining Lists
• Several lists can be combined into one list using c.
• If a combination of lists and atomic vectors is given to c then the function will first coerce each atomic
vector to lists before combining them.

List3 <- c(List1, list(new = 7:10, new2 = c("G", "H")))


str(List3)
List of 6
$ INT : int [1:3] 1 2 3
$ LOG : logi [1:2] FALSE TRUE
$ DOU : num [1:4] 1 2 3 4
$ CHA : chr "z"
$ new : int [1:4] 7 8 9 10
$ new2: chr [1:2] "G" "H"

19
List4 <- list(a = 1, b = 2)
Vec1 <- 3:4
Vec2 <- c(5.0, 6.0)
List5 <- c(List4, Vec1, Vec2)
List6 <- list(List4, Vec1, Vec2)

str(List5)
List of 6
$ a: num 1
$ b: num 2
$ : int 3
$ : int 4
$ : num 5
$ : num 6
str(List6)
List of 3
$ :List of 2
..$ a: num 1
..$ b: num 2
$ : int [1:2] 3 4
$ : num [1:2] 5 6

More on lists

• The typeof a list is a list.


• is.list can be used to check if an object is a list.
• as.list can be used to coerce to a list.
• To convert a list to an atomic vector unlist can be used. It uses the same coercion rules as c.
• From many statistical functions which return more complicated data structures the results are actually
lists.

Attributes

• All objects in R can have additional attributes to store metadata about the object. The number of
attributes is basically not limited. And it can be thought of as a named list with unique component
names.
• Individual attributes can be accessed using the function attr or all at once using the function
attributes.

Attributes examples

VecX <- 1:5


attr(VecX, "attribute1") <- "I'm a vector"
attr(VecX, "attribute2") <- mean(VecX)
attr(VecX, "attribute1")
[1] "I'm a vector"
attr(VecX, "attribute2")
[1] 3
attributes(VecX)

20
$attribute1
[1] "I'm a vector"

$attribute2
[1] 3
typeof(attributes(VecX))
[1] "list"

Special attributes in R

In R 3 attributes play a special role and we will came back later to them in more detail and just mention
them now shortly:

• names: the names attributes is a character vector giving each element a name. This will be discussed
soon.
• dimension: the dim for dimension attribute will turn vectors in matrices and arrays.
• class: the class attribute is very important in the context of S3 classes discussed later.

Attributes when the object is manipulated

• Depending on the function used attributes might or might not get lost.
• The three special attributes mentioned earlier have special roles and are usually not lost, many other
attributes get however often lost.

attributes(5 * VecX - 7)
$attribute1
[1] "I'm a vector"

$attribute2
[1] 3
attributes(sum(VecX))
NULL
attributes(mean(VecX))
NULL

The names attribute

There are three different ways to name a vector:

1. Directly at creation:

Nvec1 <- c(a = 1, b = 2, c = 3)


Nvec1
a b c
1 2 3

2. By modifying an existing vector in place:

21
Nvec2 <- 1:3
Nvec2
[1] 1 2 3
names(Nvec2) <- c("a", "b", "c")
Nvec2
a b c
1 2 3

3. By creating a modified copy:

Nvec3 <- setNames(1:3, c("a", "b", "c"))


Nvec3
a b c
1 2 3

Properties of names

• Names do not have to be unique


• Not all elements need names. If no element has the name the names attribute value is NULL. If some
elements have names but others not, then missing elements get an empty string as name.
• Names are usually the most useful if all elements have a name and if the names are all unique.
• Name attributes can be removed by assigning names(object) <- NULL.

names(c(a = 1, 2, 3))
[1] "a" "" ""
names(1:3)
NULL

Factors

Categorical data is an important data type in statistics - in R they are usually represented by factors.
A factor in R is basically an integer vector with two attributes:

1. The class attribute which has the value factor and which makes it behave differently compared to
standard integer values.
2. The levels attribute which specifies a set of admissible integers the vector can have.

A factor is usually created with the function factor.

Factors demo

Fac1 <- factor(c("green", "green", "blue"))


Fac1
[1] green green blue
Levels: blue green
class(Fac1)

22
[1] "factor"
levels(Fac1)
[1] "blue" "green"

Fac1[2] <- "red"


Warning in `[<-.factor`(`*tmp*`, 2, value = "red"): invalid factor level, NA
generated

Fac1 <- factor(c("green", "green", "blue"))


Fac2 <- factor(c("green", "blue", "blue"))
Fac3 <- factor(c("green", "green"))
Fac4 <- c(Fac1, Fac2)
Fac4
[1] green green blue green blue blue
Levels: blue green
Fac5 <- c(Fac1, Fac3)
Fac5
[1] green green blue green green
Levels: blue green

Levels of a factor

Hence all possible values of a factor should be specified, even when they are not all appearing in the observed
vector. This will also often be more informative when analyzing data.

SexCha <- c("male", "male", "male")


SexFac <- factor(SexCha, levels = c("male", "female"))

table(SexCha)
SexCha
male
3
table(SexFac)
SexFac
male female
3 0

The function relevel

• In statistics often one group is used as a reference group and all other groups are compared to this
group.
• To achieve this in R the reference group should be the first level of a factor.
• To change the order of the levels, the function relevel should be used.

Examples relating to factors

treat <- factor(rep(c(1, 3), c(2, 4)),


labels = c("DRUG2", "PLACEBO"))

23
treat
[1] DRUG2 DRUG2 PLACEBO PLACEBO PLACEBO PLACEBO
Levels: DRUG2 PLACEBO
treat2 <- factor(rep(c(1, 3), c(2, 4)), levels = 1:3,
labels = c("DRUG2", "DRUG1", "PLACEBO"))
treat2
[1] DRUG2 DRUG2 PLACEBO PLACEBO PLACEBO PLACEBO
Levels: DRUG2 DRUG1 PLACEBO
treat3 <- relevel(treat2, ref = "PLACEBO")
treat3
[1] DRUG2 DRUG2 PLACEBO PLACEBO PLACEBO PLACEBO
Levels: PLACEBO DRUG2 DRUG1

Categorizing a numeric vector

Often one observes numeric values for a variable and one wants to categorize it accordingly to its value. This
can easily be done using the function cut.

BMI <- round(rnorm(8, 20, 8), 2)


BMI
[1] 17.98 13.07 24.66 19.90 17.00 22.54 16.09 41.27
BMI.cat.1 <- cut(BMI, c(0, 18.5, 25, 30, 100))
BMI.cat.1
[1] (0,18.5] (0,18.5] (18.5,25] (18.5,25] (0,18.5] (18.5,25] (0,18.5]
[8] (30,100]
Levels: (0,18.5] (18.5,25] (25,30] (30,100]
BMI.cat.2 <- cut(BMI, c(0, 18.5, 25, 30, 100),
labels = c("low", "normal", "heavy", "obese"))
BMI.cat.2
[1] low low normal normal low normal low obese
Levels: low normal heavy obese

Arrays and matrices

• Adding a dim attribute to an atomic vector allows it to behave like a multidimensional array.
• A special case of an array is a matrix - there the dimension attribute is of length 2.

• While matrices are an essential part of statistics, arrays are much rarer but are still useful.
• Usually matrices and arrays are not created by modifying atomic vectors but by using the functions
matrix and array.

Examples of matrices and arrays

M1 <- matrix(1:6, ncol = 3, nrow = 2)


M1
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6

24
A1 <- array(1:24, dim = c(3, 4, 2))
A1
, , 1

[,1] [,2] [,3] [,4]


[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12

, , 2

[,1] [,2] [,3] [,4]


[1,] 13 16 19 22
[2,] 14 17 20 23
[3,] 15 18 21 24

Length and names for matrices

• Naturally, also the “length” attribute of a matrix is then two-dimensional. The corresponding functions
are ncol and nrow.

• Similarly “names” has the two version colnames and rownames.

ncol(M1)
[1] 3
nrow(M1)
[1] 2
colnames(M1) <- LETTERS[1:3]
rownames(M1) <- letters[1:2]
M1
A B C
a 1 3 5
b 2 4 6
rownames(M1)
[1] "a" "b"
length(M1) ## number of elements in matrix!
[1] 6
c(M1) ## columns are apended in an atomic vector
[1] 1 2 3 4 5 6

Length and names for arrays

The counterpart of length for an array is dim and the counterpart of names is dimnames which is list of
character vectors of appropriate length.

dim(A1)
[1] 3 4 2
dimnames(A1) <- list(c("r1", "r2", "r3"), c("c1", "c2", "c3", "c4"),
c("a1", "a2"))
A1
, , a1

25
c1 c2 c3 c4
r1 1 4 7 10
r2 2 5 8 11
r3 3 6 9 12

, , a2

c1 c2 c3 c4
r1 13 16 19 22
r2 14 17 20 23
r3 15 18 21 24

Useful functions in the context for matrices and arrays

• The extension for c for matrices is cbind and rbind. Similarly the package abind provides the function
abind.
• For transposing a matrix in R the function t is available and for the array counterpart the function
aperm.
• To check if an object is a matrix / array the functions is.matrix / is.array can be used.
• Similarly coercion to matrices and arrays can be performed using as.matrix / as.array.

Data frames

• Data frames are in R the most common structures to store data.


• Internally it is the same as a list of equal length vectors.
• It has however also a similar structure as a matrix.
• Hence it shares properties from both types.
• For example the function length returns for a data frame the number of list components, which is the
number of columns and hence the same as ncol. While nrow returns the the number of rows.
• Following the same reasoning, names gives the names of the vectors which is the same as colnames.
rownames in turn gives the row names.

Data frame creation

The function data.frame can be used to create data frames. Since R 4.0.0 it does not by default convert
character vectors to factors anymore.

DF1 <- data.frame(V1 = 1:5,


V2 = c("a", "a", "b", "a", "d"))
str(DF1)
'data.frame': 5 obs. of 2 variables:
$ V1: int 1 2 3 4 5
$ V2: chr "a" "a" "b" "a" ...

Most functions which read external data into R also return a data frame.

26
stringAsFactors

Note that argument stringAsFactors = TRUE provides the old behaviour of automatic conversion.

DF2 <- data.frame(V1 = 1:5,


V2 = c("a", "a", "b", "a", "d"),
stringsAsFactors = TRUE)
str(DF2)
'data.frame': 5 obs. of 2 variables:
$ V1: int 1 2 3 4 5
$ V2: Factor w/ 3 levels "a","b","d": 1 1 2 1 3

This can also be controlled globally by using

options(stringAsFactors = TRUE)

More on data frames

Basically a data frame is a list with an S3 class attribute. So “checks” of a data frame yield:

typeof(DF1)
[1] "list"
class(DF1)
[1] "data.frame"
is.data.frame(DF1)
[1] TRUE

Coercion to data frames I

Lists, vectors and matrices can be coerced to data frames if it is appropriate. For lists this means that all
objects have the same “length”.

V1 <- 1:5
L1 <- list(V1 = V1, V2 = letters[c(1, 2, 3, 2, 1)])
L2 <- list(V1 = V1, V2 = letters[c(1, 2, 3, 2, 1, 3)])
str(as.data.frame(V1))
'data.frame': 5 obs. of 1 variable:
$ V1: int 1 2 3 4 5

Coercion to data frames II

str(as.data.frame(M1))
'data.frame': 2 obs. of 3 variables:
$ A: int 1 2
$ B: int 3 4
$ C: int 5 6
str(as.data.frame(L1))
'data.frame': 5 obs. of 2 variables:
$ V1: int 1 2 3 4 5

27
$ V2: chr "a" "b" "c" "b" ...
str(as.data.frame(L2))
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply dif

Combining data frames

• The basic functions to combine two data frames (works similar with matrices) are cbind and rbind.
• When combining column-wise, then the numbers of rows must match and row names are ignored (hence
observations need to be in the same order).
• When combining row-wise the number of columns and their names must match.
• For more advanced combining see the function merge.

Combining data frames examples

cbind(DF1, data.frame(new = 6:10))


V1 V2 new
1 1 a 6
2 2 a 7
3 3 b 8
4 4 a 9
5 5 d 10
rbind(DF1, data.frame(V1 = 1, V2 = "c"))
V1 V2
1 1 a
2 2 a
3 3 b
4 4 a
5 5 d
6 1 c

More about cbind

Note that cbind (and rbind) try to make matrices when possible. Only if at least one of the elements to be
combined is a data frame the results will be also a data.frame.
Hence vectors can’t usually be combined into a data frame using cbind.

V1 <- 1:3
V2 <- c("a", "b", "a")
str(cbind(V1, V2))
chr [1:3, 1:2] "1" "2" "3" "a" "b" "a"
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:2] "V1" "V2"

28
Special columns in a data frame

• Since a data frame is list of vectors it is possible to have a list as a column.


• However, when a list is given to the data.frame function it usually fails as the function tries to put
each list item into its own column.
• A workaround is to use the function I which is a protector function and says something should be
treated as is.

• More common than adding a list is to add a matrix to a data frame - also here should the protector
function I be used.

Special columns in a data frame examples

DF4 <- data.frame(a = 1:3)

# works:
DF4$b <- list(1:2,1:3,1:4)
DF4
a b
1 1 1, 2
2 2 1, 2, 3
3 3 1, 2, 3, 4
# does not work
DF5 <- data.frame(a = 1:3, b = list(1:2,1:3,1:4))
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply dif
# does work
DF6 <- data.frame(a = 1:3, b = I(list(1:2,1:3,1:4)))
DF6
a b
1 1 1, 2
2 2 1, 2, 3
3 3 1, 2, 3, 4

DF6 <- data.frame(a = 1:3, b = I(matrix(1:6, nrow = 3)))


DF6
a b.1 b.2
1 1 1 4
2 2 2 5
3 3 3 6
str(DF6)
'data.frame': 3 obs. of 2 variables:
$ a: int 1 2 3
$ b: 'AsIs' int [1:3, 1:2] 1 2 3 4 5 6

Subsetting data structures in R

To work with data subsetting is a key feature. R is really flexible in this regard and has many different
ways to subset the different data structures.
In the following we will discuss the main ways for the main data structures.

29
Subsetting atomic vectors

We will start subsetting atomic vectors as subsetting other structures is quite similar.
There are six ways to subset an atomic vector:

1. positive indexing using positive integers.


2. negative indexing using negative integers.
3. logical indexing using logical vectors.
4. named indexing using character vectors.
5. blank indexing.
6. zero indexing.

Positive indexing of atomic vectors

Specifying in square brackets the position of the elements which should be selected.

V1 <- c(1, 3, 2.5, 7.2, -3.2)


# basic version
V1[c(1, 3)]
[1] 1.0 2.5
# same elements can be selected multiple times
V1[c(1, 3, 1, 3, 1, 3, 1)]
[1] 1.0 2.5 1.0 2.5 1.0 2.5 1.0
# double valued indices are truncated to integers
V1[c(1.1, 3.9)]
[1] 1.0 2.5

Negative indexing of atomic vectors

Specifying in square brackets the positions of the elements which should not be selected.

V1[-c(2, 4, 5)]
[1] 1.0 2.5

Note that positive and negative indexing cannot be combined:

V1[c(-1, 2)]
Error in V1[c(-1, 2)]: only 0's may be mixed with negative subscripts

Logical indexing of atomic vectors

Giving in square brackets a logical vector of the same length means that the elements with value TRUE will
be selected.

# basic version
V1[c(TRUE, FALSE, TRUE, FALSE, FALSE)]
[1] 1.0 2.5
# if the logical vector is too short,
# it will be recycled.

30
V1[c(TRUE, FALSE, TRUE)]
[1] 1.0 2.5 7.2
# most common is to use expression
# which return a logical vector
V1[V1 < 3]
[1] 1.0 2.5 -3.2

Named indexing of atomic vectors

Giving in square brackets a character vector of the names which should be selected.

names(V1) <- letters[1:5]


# basic version
V1[c("a", "c")]
a c
1.0 2.5
# same elements can be selected multiple times
V1[c("a", "c", "a", "c", "a", "c", "a")]
a c a c a c a
1.0 2.5 1.0 2.5 1.0 2.5 1.0
# names are matched exactly
V1[c("a", "c", "ab", "z")]
a c <NA> <NA>
1.0 2.5 NA NA

Blank and zero indexing of atomic vectors

Blank indexing is not useful for atomic vectors but will be relevant for higher dimensional objects. It
returns in this case the original atomic vector.

V1[]
a b c d e
1.0 3.0 2.5 7.2 -3.2

Zero indexing returns in this case a zero length vector. It is often used when generating testing data.

V1[0]
named numeric(0)

Indexing lists

Lists are in general subset quite like atomic vectors. There are however more operators available for subset-
ting:

1. [ ([ ])
2. [[ ([[ ]])
3. $

The first one returns always a list, the other two options extract list components (details will follow later).

31
L1 <- list(a = 1:2, b = letters[1:3], c = c(TRUE, FALSE))
L1[1]
$a
[1] 1 2
L1[[1]]
[1] 1 2
L1$a
[1] 1 2

Indexing matrices and arrays

Subsetting of higher dimensional objects can be done in three ways:

1. using multiple vectors.


2. using a single vector.
3. using matrices

The most common way is to generalize the atomic vector subsetting to higher dimension by using one of the
six methods described earlier for each dimension.
Here then especially the blank indexing becomes relevant.
We will focus here on matrices, but arrays work basically the same.

Subsetting matrices with two vectors

M1 <- matrix(1:6, ncol = 3)


rownames(M1) <- LETTERS[1:2]
colnames(M1) <- letters[1:3]

M1[c(TRUE, FALSE), c("b", "c")]


b c
3 5

M1[ ,c(1, 1, 2)]


a a b
A 1 1 3
B 2 2 4

M1[-2, ]
a b c
1 3 5

Subsetting matrices with one vector

As matrices (arrays) are essentially vectors with a dimension attribute, also a single vector can be used to
extract elements. For this it is important that matrices (arrays) filled in column major order.

32
M2 <- outer(1:5, 1:5, paste, sep = ",")
M2
[,1] [,2] [,3] [,4] [,5]
[1,] "1,1" "1,2" "1,3" "1,4" "1,5"
[2,] "2,1" "2,2" "2,3" "2,4" "2,5"
[3,] "3,1" "3,2" "3,3" "3,4" "3,5"
[4,] "4,1" "4,2" "4,3" "4,4" "4,5"
[5,] "5,1" "5,2" "5,3" "5,4" "5,5"

M2[c(3, 17)]
[1] "3,1" "2,4"

Subsetting matrices with a matrix

This is rarely done but possible. To select elements from an n-dimensional object, a matrix with n columns
can be used. Each row of the matrix specifies one element. The result will always be a vector. The matrix
can consist of integers or of characters (if the array is named).

M3 <- matrix(ncol = 2, byrow = TRUE,


data = c(1, 4,
3, 3,
5, 1))
M2[M3]
[1] "1,4" "3,3" "5,1"

Subsetting data frames

Recall that data frames are on the one side lists and on the other side similar to matrices.
If a data frame is subset with a single vector it behaves like a list. If subset with two vectors it behaves like
a matrix.

DF1 <- data.frame(a = 4:6, b = 7:5, c = letters[15:17])


DF1[DF1$a <= 5, ]
a b c
1 4 7 o
2 5 6 p
DF1[c(1,3), ]
a b c
1 4 7 o
3 6 5 q

Subsetting data frames II

To select columns:

# like a matrix
DF1[, c("a","c")]
a c
1 4 o

33
2 5 p
3 6 q
# like a list
DF1[c("a","c")]
a c
1 4 o
2 5 p
3 6 q

Subsetting data frames III

The behavior differs, if only one column is selected:

# like a matrix
DF1[, "a"]
[1] 4 5 6
# like a list
DF1[ "a"]
a
1 4
2 5
3 6

Subsetting arbitrary S3 objects

• In general S3 objects consist of atomic vectors, matrices, arrays, lists and so on. And they can be
extracted from the S3 object using the same ways as described above.

• Again, the initial step is to look at str to reveal the details of the object.

Subsetting an S3 object example I

set.seed(1)
x <- runif(1:100)
y <- 3 + 0.5 * x + rnorm(100, sd = 0.1)
fit1 <- lm(y ~ x)
class(fit1)
[1] "lm"

Assume we want to extract individually the three parameters of the model.

Subsetting an S3 object example II

str(fit1)
List of 12
$ coefficients : Named num [1:2] 2.982 0.531
..- attr(*, "names")= chr [1:2] "(Intercept)" "x"

34
$ residuals : Named num [1:100] 0.0495 -0.0549 0.0342 -0.1234 0.1549 ...
..- attr(*, "names")= chr [1:100] "1" "2" "3" "4" ...
$ effects : Named num [1:100] -32.572 1.414 0.028 -0.137 0.157 ...
..- attr(*, "names")= chr [1:100] "(Intercept)" "x" "" "" ...
$ rank : int 2
$ fitted.values: Named num [1:100] 3.12 3.18 3.29 3.46 3.09 ...
..- attr(*, "names")= chr [1:100] "1" "2" "3" "4" ...
$ assign : int [1:2] 0 1
$ qr :List of 5
..$ qr : num [1:100, 1:2] -10 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : chr [1:100] "1" "2" "3" "4" ...
.. .. ..$ : chr [1:2] "(Intercept)" "x"
.. ..- attr(*, "assign")= int [1:2] 0 1
..$ qraux: num [1:2] 1.1 1.05
..$ pivot: int [1:2] 1 2
..$ tol : num 1e-07
..$ rank : int 2
..- attr(*, "class")= chr "qr"
$ df.residual : int 98
$ xlevels : Named list()
$ call : language lm(formula = y ~ x)
$ terms :Classes 'terms', 'formula' language y ~ x
.. ..- attr(*, "variables")= language list(y, x)
.. ..- attr(*, "factors")= int [1:2, 1] 0 1
.. .. ..- attr(*, "dimnames")=List of 2
.. .. .. ..$ : chr [1:2] "y" "x"
.. .. .. ..$ : chr "x"
.. ..- attr(*, "term.labels")= chr "x"
.. ..- attr(*, "order")= int 1
.. ..- attr(*, "intercept")= int 1
.. ..- attr(*, "response")= int 1
.. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
.. ..- attr(*, "predvars")= language list(y, x)
.. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
.. .. ..- attr(*, "names")= chr [1:2] "y" "x"
$ model :'data.frame': 100 obs. of 2 variables:
..$ y: num [1:100] 3.17 3.12 3.32 3.34 3.24 ...
..$ x: num [1:100] 0.266 0.372 0.573 0.908 0.202 ...
..- attr(*, "terms")=Classes 'terms', 'formula' language y ~ x
.. .. ..- attr(*, "variables")= language list(y, x)
.. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
.. .. .. ..- attr(*, "dimnames")=List of 2
.. .. .. .. ..$ : chr [1:2] "y" "x"
.. .. .. .. ..$ : chr "x"
.. .. ..- attr(*, "term.labels")= chr "x"
.. .. ..- attr(*, "order")= int 1
.. .. ..- attr(*, "intercept")= int 1
.. .. ..- attr(*, "response")= int 1
.. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
.. .. ..- attr(*, "predvars")= language list(y, x)
.. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
.. .. .. ..- attr(*, "names")= chr [1:2] "y" "x"

35
- attr(*, "class")= chr "lm"

Subsetting an S3 object example III

# the intercept
fit1$coefficients[1]
(Intercept)
2.982
# the slope
fit1$coefficients[2]
x
0.5312
# sigma needs to be computed
sqrt(sum((fit1$residuals-mean(fit1$residuals))ˆ2)
/fit1$df.residual)
[1] 0.09411

Subsetting an S3 object example IV

The variance is actually done by summary.lm:

summary(fit1)

Call:
lm(formula = y ~ x)

Residuals:
Min 1Q Median 3Q Max
-0.18498 -0.05622 -0.00871 0.05243 0.25166

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.9821 0.0206 145 <2e-16 ***
x 0.5312 0.0353 15 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.0941 on 98 degrees of freedom


Multiple R-squared: 0.697, Adjusted R-squared: 0.694
F-statistic: 226 on 1 and 98 DF, p-value: <2e-16

Subsetting arbitrary S4 objects

S4 objects have their own subsetting operators.

• the $ operator is replaced by @.


• the [[ operator is replaced by the function slot.

These operators are much more restrictive than their standard counterparts.

36
More on standard subsetting operators

We used already the operators [[ and $ which are frequently used when extracting parts from lists and other
objects.

• [[ is similar to [, but in can only extract single a value/component. Hence only positive integers or a
strings can be used in combination with [[.
• $ is a shorthand for [[ when the component is named.

These operators are mainly used in the context of lists and the difference is that [ returns always a list and
[[ gives the content of the list.

Examples for [[ and $ I

str(L1)
List of 3
$ a: int [1:2] 1 2
$ b: chr [1:3] "a" "b" "c"
$ c: logi [1:2] TRUE FALSE
L1[[1]]
[1] 1 2
L1[1]
$a
[1] 1 2
L1$a
[1] 1 2
str(L1[[1]])
int [1:2] 1 2
str(L1[1])
List of 1
$ a: int [1:2] 1 2
str(L1$a)
int [1:2] 1 2

Examples for [[ and $ II

If [[ is used with a vector of integers or characters then it is assuming nested list structures.

L2 <- list(a = list(A = list(aA = 1:3, bB = 4:6),


B = "this"),
b = "that")
str(L2)
List of 2
$ a:List of 2
..$ A:List of 2
.. ..$ aA: int [1:3] 1 2 3
.. ..$ bB: int [1:3] 4 5 6
..$ B: chr "this"
$ b: chr "that"
L2[[c("a","A","aA")]]
[1] 1 2 3

37
Simplification vs preservation

As the different subsetting operators have different properties simplifying or preservation needs to be
remembered at all times as it can have huge impact in programming.
In doubt it is usually better not to simplify. As it is then better that an object is always of the type it was
originally.
To prevent or force simplification, the argument drop can be specified in [.

Details about simplification vs preservation

Simplification Preservation
vector x[[1]] x[1]
list x[[1]] x[1]
factor x[ind, drop=TRUE] x[1]
matrix x[1,] or x[,1] x[ind, , drop = FALSE]
or x[,ind, drop = FALSE]
data frame x[,1] or x[[1]] x[ , 1, drop = FALSE] or x[1]

here ind is an indexing vector of positive integers and naturally arrays behave the “same” as matrices.

What does simplification mean for atomic vectors?

Simplification for atomic vectors concerns the loss of names.

V1 <- c(a=1, b=2, c=3)


V1[1]
a
1
V1[[1]]
[1] 1

What does simplification mean for lists?

Simplification for lists concerns if the result has to be a list or can be of the type of the extracted object.

L1 <- list(a = 1, b = 2:3, c = "a")


str(L1[1])
List of 1
$ a: num 1
str(V1[[1]])
num 1

What does simplification mean for factors?

Simplification for factors mean that unused levels are dropped.

38
F1 <- factor(c("a", "b", "a"),
levels = c("a","b","c"))
F1
[1] a b a
Levels: a b c
F1[1]
[1] a
Levels: a b c
F1[1, drop = TRUE]
[1] a
Levels: a
droplevels(F1)
[1] a b a
Levels: a b

What does simplification mean for matrices?

Simplification for matrices concerns the loss of a dimension.

M1 <- matrix(1:6, nrow=3)


M1[, 1 , drop = FALSE]
[,1]
[1,] 1
[2,] 2
[3,] 3
M1[, 1]
[1] 1 2 3

What does simplification mean for matrices? II

Simplification for matrices concerns the loss of a dimension.

A1 <- array(1:12, dim = c(2, 3, 2))


A1[ , , 1, drop = FALSE]
, , 1

[,1] [,2] [,3]


[1,] 1 3 5
[2,] 2 4 6
dim(A1[ , , 1, drop = FALSE])
[1] 2 3 1
A1[ , , 1]
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
dim(A1[ , , 1])
[1] 2 3

What does simplification mean for data frames?

Simplification for data frames means single columns are returned as vectors and not as data frames.

39
DF1 <- data.frame(a = 1:2, b = letters[1:2])
str(DF1[1])
'data.frame': 2 obs. of 1 variable:
$ a: int 1 2
str(DF1[[1]])
int [1:2] 1 2
str(DF1[ , "a", drop=FALSE])
'data.frame': 2 obs. of 1 variable:
$ a: int 1 2
str(DF1[ , "a"])
int [1:2] 1 2

More on $

Basically x$n is equivalent to x[["n", exact = FALSE]].


It is usually used to extract variables from a data frame.
Note that cannot be used to extract variables using stored variable names:

DF1 <- data.frame(a = 1:2, b = letters[1:2])


name.a <- "a"
DF1$name.a
NULL
DF1[[name.a]]
[1] 1 2

More on $ II

Another difference between $ and [[ is that $ does partial matching.

DF1 <- data.frame(aaa=1:2, bbb=letters[1:2])


DF1$a
[1] 1 2
DF1[["a"]]
NULL

Subsetting and assignment

All subsetting operators can be combined with assigning values to the selected parts.

x <- 1:6
x
[1] 1 2 3 4 5 6
x[1] <- 20
x
[1] 20 2 3 4 5 6
x[-1] <- 51:55
x
[1] 20 51 52 53 54 55
x[c(1,1)] <- c(-10,-20)

40
x
[1] -20 51 52 53 54 55
## Logical & NA indexing can be combined!
## It is be recycled:
x[c(TRUE,FALSE,NA)] <- 1
x
[1] 1 51 52 1 54 55

Data storage

Approximate storage of numbers

• While it is possible for a computer to store numbers exactly, it is more common to use approximate
representations.
• R uses double precision floating point numbers for its numeric computations.

• E.g., 123.45 is a decimal floating point number everyone understands to be the same as: 123.45 =
1 · 102 + 2 · 101 + 3 · 100 + 4 · 10−1 + 5 · 10−2 .
• One can also write this as 123.45 = 12345 · 10−2 = 1.2345 · 102 (last is normalized form).
• The sequence of (here, decimal) digits 12345 is called the significand (or mantissa), the 2 is the exponent
(or characteristic) of the number.

Floating point number systems

• A floating point number system is characterized by four integers: b (base or radix), p (precision), and
emin and emax (minimal and maximal exponents).

• It consists of numbers of the form:


 
δ1 δp−1
x = ± δ0 + 1 + . . . + p−1 be
b b

where emin ≤ e ≤ emax , δi ∈ {0, . . . , b − 1}. The number is normalized if δ0 = 1.


• In binary system, b = 2 and the digits are δi ∈ {0, 1}.

IEEE 754 (I)

• Clearly, all floating point numbers can be represented by the triple (sign, exponent, significand).
• IEEE 754 is a standard for base 2 which says: for double precision, use 64 bits (8 bytes) overall, split
as sign: 1 bit, exponent: 11 bits, significand: 52 bits.

• In principle, the exponent is represented using the biased scheme biased scheme.

– Note: in this scheme, for k = 11 bits, the representation number βk−1 βk−2 ...β0 corresponds to
Pk−1
e = i=0 βi 2i − (210 − 1).
– So the exponent range would be −1023, −1022, . . . , 1023, 1024.

• But, the smallest (all zero’s) and largest (all one’s) exponents are special!

41
• Representing binary floating point numbers in IEEE 754 works as follows:

(a) Exponent neither all 0 bits or all 1 bits: this is the normalized number
 
δ1 δ52
± 1+ + . . . + 52 2e
2 2
(b) Exponent all 0 bits: this is the de-normalized number
 
δ1 δ52
± 0+ + . . . + 52 2−1022
2 2
(c) Exponent all 1 bits: if all bits in the significand are 0, this is ±∞; otherwise, it is a NaN.

IEEE 754: Examples


• The standard layout for the double precision representation is
σ ϵ10 , ϵ9 , . . . ϵ0 δ1 , . . . δ52

• Question: Which IEEE 754 floating point number does the following correspond to?
σ 11 . . . 1 0 . . . 0

• Answer: all exponents are 1 and all significands are zero, so (c) on previous slide applies: number is
±∞.
• Question: Which IEEE 754 floating point number does the following correspond to?
σ 00 . . . 0 0 . . . 0

• Answer: all exponents are 0 so this is a denormalized number (see (b) on previous slide) which has
δ0 = 0 and  
0 0
± 0 + + . . . + 52 2−1022 = 0
2 2
• Note: This is how we get two zeros (because of the sign bit).
• Question: what is the smallest positive normalized number we can do?
• Answer:
– the exponent should be as small as possible: 000. . . 001 (all zero’s does not work, as the number
is normalized).
– the significand should be as small as possible: 000. . . 000.

 
0 0
+ 1 + + . . . + 52 2−1022 = 2−1022
2 2

• Question: what is the largest positive denormalized number we can do?


• Answer:
– the exponent should be: 000. . . 000 (as the number is denormalized).
– the significand should be as large as possible: 111. . . 111.

  52
1 1 −1022
X
+ 0 + + . . . + 52 2 = 2−i 2−1022 = 2−1022 (1 − 2−52 )
2 2 i=1

42
Rounding effects

• The maximal precision we can expect for floating point computations in R is 16 decimal digits after
the comma (52 binary digits).
• So the basic rule 1 + x = 1 ⇒ x = 0 does not hold in floating point arithmetic! (2−52 is the smallest
positive number greater than 1).

x <- 2ˆ(-52)
1 + x == 1

[1] FALSE

x <- 2ˆ(-53)
1 + x == 1

[1] TRUE

Rounding effects: Example 1

• Consider 5/4 and 4/5. In decimal notation these can be exactly represented as 1.25 and 0.8.
• In binary notation:

– 5/4 = 1.01 can be exactly represented.


– 4/5 = 0.110011001100... cannot be exactly represented. Some rounding error will occur in the
storage.

• For example we know that 5/4 · (n · 4/5) = n. But in R. . .

n <- 1:10
1.25 * (n * 0.8) == n

[1] TRUE TRUE FALSE TRUE TRUE FALSE FALSE TRUE TRUE TRUE

• To avoid issues:

all.equal(1.25 * (n * 0.8), n)

[1] TRUE

Rounding effects: Example 2

• Rounding errors tend to accumulate so a long series of calculations will result in larger errors than a
shorter one.

# Three ways of computing the variance:


x <- 1:11
mean(x)

[1] 6

43
var(x) # built in

[1] 11

sum((x - mean(x))ˆ2)/10

[1] 11

(sum(xˆ2) - 11 * mean(x)ˆ2) /10

[1] 11

Assume we add a large value to x:

# Three ways of computing the variance:


x <- 1:11 + 10ˆ10
var(x) # built in

[1] 11

sum((x - mean(x))ˆ2)/10

[1] 11

(sum(xˆ2) - 11 * mean(x)ˆ2) /10 # Oh No...

[1] -13107

Integer storage

• In R, k = 32 bits (4 bytes) are used for integers.


• For general k, there are 2k such sequences.
• Which numbers should these bit sequences correspond to?
• Obvious idea: the numbers with binary representation given by the respective bit sequences. I.e., for
k = 3 the following would give all 8 integers between 0 and 7.

000: 0 * 2ˆ2 + 0 * 2ˆ1 + 0 * 2ˆ0 = 0


001: 0 * 2ˆ2 + 0 * 2ˆ1 + 1 * 2ˆ0 = 1
...
111: 1 * 2ˆ2 + 1 * 2ˆ1 + 1 * 2ˆ0 = 7

• But what about negative numbers?


• Solutions: sign and magnitude, bias, two’s complement
• R uses k = 32 bits two’s complement with one modification: 10...0 ⇐⇒ NA_integer_ (the integer
missing value).

44
• So the 232 = 4294967296 bit sequences have one zero, one NA, and (232 − 2)/2 = 231 − 1 = 2147483647
positive and negative integers each.
• The smallest such integer is −(231 − 1), the largest is 231 − 1.

as.integer(2 ˆ 31 - 1) # works

[1] 2147483647

as.integer(2 ˆ 31) # not anymore ...

Warning: NAs introduced by coercion to integer range

[1] NA

Flow control

Flow control

• Many problems are often of a repetitive nature and solutions are not obtained in a single instance but
certain steps need to be repeated.
• For example

– In simulations usually certain procedures need to be repeated a fixed number of times.


– In algorithms the steps need to be repeated until some convergence criterion is reached.

• For this flow control R offers different constructs which we will introduce in the following slides.

for loop

• The for() statement in R specifies that certain statements are to be repeated a fixed number of times.
• The syntax looks like:

for (index in vector) {


statements
}

• This means that the variable index runs through all elements in vector. For each value then in vector
the statements are executed.

• If for each value a result is created which should be stored, then it is recommended to create first an
object of the appropriate length which is used to store the results.

Fibonacci numbers

To compute in R the first 10 Fibonacci numbers we can use a for loop in the following way:

45
Fib <- numeric(10) ## create a vector which will store numeric elements
Fib[1] <- 1
Fib[2] <- 1

for (i in 3:10) {
Fib[i] <- Fib[i-1] + Fib[i-2]
}
Fib

[1] 1 1 2 3 5 8 13 21 34 55

if statement
• The if statement can be used to control whether and when certain statements are to be executed.
• There are two versions:

if (condition) {
statements when condition is TRUE
}

or

if (condition){
statements when TRUE
} else {
statements when FALSE
}

if else example

x <- 3
if (x < 5) {
print("'x' is smaller than 5")
} else {
print("'x' is at least 5")
}

[1] "'x' is smaller than 5"

while loop
• The while loop can be used when statements have to be repeated but is not known in advance how
often exactly. The computations should be continued as long as a condition is fullfilled.
• The syntax looks like:

while (condition) {
statements
}

• Hence here condition is evaluated and if FALSE nothing will be done. If the condition is however TRUE,
then the statements are executed. After the statements are executed, the condition is again evaluated.

46
Fibonacci numbers II

To compute for example all Fibonacci numbers smaller than 100 we could use

Fib1 <- 1
Fib2 <- 1
Fibs <- c(Fib1)
while (Fib2 < 100) {
Fibs <- c(Fibs, Fib2)
oldFib2 <- Fib2
Fib2 <- Fib1 + Fib2
Fib1 <- oldFib2
}
Fibs

[1] 1 1 2 3 5 8 13 21 34 55 89

Note: increasing the length of a vector can be costly for R! Avoid if possible.

repeat loop

• If a loop is needed which does not go through a prespecified number of iterations or should not have
a condition check at the top the repeat loop can be used.
• The syntax looks like:

repeat {
statements
}

• This causes the statement to be repeated endlessly. Therefore a terminator called break needs to be
included. It is usually included as:

if (condition) break

break and next

• In general the break command can be used in any loop and it causes the loop to terminate immediately.
• Similarly, the command next can also be used in any loop and causes that the computations of the
current iteration are terminated immediately and the next iteration is started from the top.

• The repeat loop and the functions break and next are rarely used since it is much easier to read and
understand programs using the other looping methods.

Fibonacci numbers III

To compute for example all Fibonacci numbers smaller than 100 we could use also

47
Fib1r <- 1
Fib2r <- 1
Fibsr <- c(Fib1r)

repeat {
Fibsr <- c(Fibsr, Fib2r)
oldFib2r <- Fib2r
Fib2r <- Fib1r + Fib2r
Fib1r <- oldFib2r
if (Fib2r > 100) break
}
Fibsr

[1] 1 1 2 3 5 8 13 21 34 55 89

switch

• Another possibility for conditional execution is the function switch. It is especially useful when the
there are more than two possibilities or if the options are named.
• The basic syntax is

switch(EXPR, options)

where EXPR can be an integer value which says which option should be chosen, alternatively it can be a
character string if the options are named.

switch examples I

R1 <- switch(1, a = 11, b = 12, cc = 13, d = 14)


R1

[1] 11

R2 <- switch("b", a = 11, b = 12, cc = 13, d = 14)


R2

[1] 12

R3 <- switch("c", a = 11, b = 12, cc = 13, d = 14)


R3

NULL

switch examples II

48
SUM <- function(x, type = "L2") {
switch(type,
L2 = {LOC <- mean(x)
SCA <- sd(x)},
L1 = {LOC <- median(x)
SCA <- mad(x)}
)
return(data.frame(LOC = LOC, SCA = SCA))
}
set.seed(1); x <- rnorm(100)
SUM(x)

LOC SCA
1 0.1089 0.8982

SUM(x, type = "L1")

LOC SCA
1 0.1139 0.87

Conditional element selection

• A function not directly connected to the previous flow control but still useful is ifelse.
• The basic syntax is

ifelse(EXPR, yes, no)

• This function is usually used when EXPR is a vector. The result is a vector of same length as EXPR that
has as corresponding entry the value of yes if EXPR is TRUE, of no if EXPR is FALSE. Missing values in
EXPR remain missing values.
• Note that ifelse will try to coerce EXPR to logical if it is not. Also the attributes from EXPR will be
kept and only the entries replaced.

ifelse example

ifelse(1:4 < 2.5, "yes", "no")

[1] "yes" "yes" "no" "no"

currency <- factor(c("dollar","euro","euro","dollar"))


amount <- rep(100, 4)
amountEuro <- ifelse(currency=="dollar",
amount * 0.85, amount)
amountEuro

[1] 85 100 100 85

49
R functions

About objects and functions

As John Chambers (creator of S) put it:

Everything that exists is an object.


Everything that happens is a function call.

Functions in R

• Functions are fundamental building blocks in R and are self contained units of code with a well-
defined purpose.
• To create a function function() is used. The parentheses enclose the arguments list. Then a single
statement or multiple statements enclosed by {} are specified.
• When R executes a function definition it produces an object with three parts:

1. body: the code inside the function.


2. formals: the list of arguments which specify how to call the function.
3. environment: a guide to where variables of the function are located.

When printing the function it will display these parts. (If the environment is not shown it is the global
environment)

Components of functions: Example I

To reduce the burden for the user, one can give default values to some arguments:

f <- function(x, y = 1) {
z <- x + y
2 * z
}
f

function(x, y = 1) {
z <- x + y
2 * z
}

Components of functions: Example I

formals(f)

$x

$y
[1] 1

50
body(f)

{
z <- x + y
2 * z
}

environment(f)

<environment: R_GlobalEnv>

Primitive functions

• There is one exception of a group of functions which have not the three parts just described - these
are called Primitive functions.
• All primitive functions are located in the base package. They call directly C code and do not contain
any R code.

sum
function (..., na.rm = FALSE) .Primitive("sum")
formals(sum)
NULL
body(sum)
NULL
environment(sum)
NULL

Every operation in R is a function call

• Really, every operation in R is a function call.


• So also +, -, *, [, $, {, for. . . are functions.

• To demonstrate how some operators are actually functions check the following code:

x <- 10
y <- 20

x + y
[1] 30

'+'(x, y)
[1] 30

Scope of variables

• The scope of a variable tells us where the variable would be recognized.


• E.g. Variables defined within functions have local scope and are only recognized within the function.

51
• In R scope is controlled by the environment of the functions.

– Variables defined in console have global scope.


– Variables defined in functions are visible in the function and in functions defined within in.

• Using local variables instead of global ones is less prone to bugs.


• Also packages in R have their own environment (known as namespace)

Scope of variables: Example 1

f <- function(x, y = 1) {
z <- x + y
2 * z
}
z

[1] 2 2 2 1 1 1 1 1

Lazy evaluation

• In the standard case, R arguments are lazy - they are only evaluated when they are actually used.
• To force an evaluation you have to use the function force.
• This also allows us to specify default values in the header of the function for variables which are created
locally.

Lazy evaluation examples

f1 <- function(x) 10
f2 <- function(x) {
force(x)
10
}
f1(stop("You made an error!"))
[1] 10
f2(stop("You made an error!"))
Error in force(x): You made an error!

Calling functions

There are different ways to call functions:

1. Named argument call: Arguments are matched by exact names.


2. Partially named argument call: Arguments are matched using the shortest unique string.
3. Positioning argument call: using the position of the arguments in the function definition.

The three different ways can also be mixed in a function call.


Then R uses first named matching, then partial named matching and finally position matching.

52
Calling functions examples

f <- function(Position1, Pos2, Pos3) {


list(pos1 = Position1, pos2 = Pos2, pos3 = Pos3)
}

str(f(Position1 = 1, Pos2 = 2, Pos3 = 3))


List of 3
$ pos1: num 1
$ pos2: num 2
$ pos3: num 3

str(f(Pos2 = 2, Position1 = 1, Pos3 =3))


List of 3
$ pos1: num 1
$ pos2: num 2
$ pos3: num 3

str(f(1, 2, 3))
List of 3
$ pos1: num 1
$ pos2: num 2
$ pos3: num 3

str(f(2, 3, Position1 = 1))


List of 3
$ pos1: num 1
$ pos2: num 2
$ pos3: num 3

str(f(2, Posi = 1, 3))


List of 3
$ pos1: num 1
$ pos2: num 2
$ pos3: num 3

str(f(2, 3, Position1 = 1))


List of 3
$ pos1: num 1
$ pos2: num 2
$ pos3: num 3

str(f(1, Pos = 2, 3))


Error in f(1, Pos = 2, 3): argument 2 matches multiple formal arguments

Functions returns

• Functions in general can return only one object as a rule. Which is however not a real restriction as
all the desired output can be collected into a list.
• The last expression evaluated in a function is by default the returned object.

53
• Whenever the function return(object) is called within a function, the function is terminated and
object is returned.

Functions returns example

f1 <- function(x) {
if (x < 0) return("not positive")
if (x < 5) {
"between 0 and 5"
} else {
"larger than 5"
}
}
f1(-1)
[1] "not positive"
f1(1)
[1] "between 0 and 5"
f1(10)
[1] "larger than 5"

Invisible return

It is possible to return objects from a function call which are not printed by default using the invisible
function.
Invisible output can be assigned to an object and/or forced to be printed by putting the function call between
round parentheses.

f1 <- function() 1
f2 <- function() invisible(1)

f1()
[1] 1
f2()

resf2 <- f2()


resf2
[1] 1
(f2())
[1] 1

The pipe operator

• The magrittr package defines the pipe operator %>% and many other packages also make use of it.
• Rather than typing f(x, y) we type x %>% f(y) (start with x then use f(y) to modify it).
• R 4.1.x contains a base R pipe |> with the same syntax:

54
x <- 1:4; y <- 4
sum(x, y)

[1] 14

x |> sum(y)

[1] 14

x |> mean()

[1] 2.5

Basic statistics in R

R for descriptive statistics

• The following slides give some first vocabulary how to do basic statistics in R and how to formulate
statistical models in R.
• The usage of those functions will be demonstrated using the crabs dataset from the package MASS.

The crabs dataset

The crabs dataset of the package MASS contains 8 variables measured on 200 crabs. The variables are:

• sp for species. Factor with levels B (blue) and O (orange).


• sex, factor with two levels giving the sex of the crab.
• index, integer values between 1 and 50. Will be ignored here (related to the study design).
• FL, frontal lobe size (mm).
• RW, rear width (mm).
• CL, carapace length (mm).
• CW, carapace width (mm).
• BD, body depth (mm).

Starting with descriptive statistics

In order to get the data ready to analyze them we do:

> library(MASS)
> data(crabs)
> # ?crabs would show the help file for the dataset
> str(crabs)
'data.frame': 200 obs. of 8 variables:
$ sp : Factor w/ 2 levels "B","O": 1 1 1 1 1 1 1 1 1 1 ...
$ sex : Factor w/ 2 levels "F","M": 2 2 2 2 2 2 2 2 2 2 ...
$ index: int 1 2 3 4 5 6 7 8 9 10 ...
$ FL : num 8.1 8.8 9.2 9.6 9.8 10.8 11.1 11.6 11.8 11.8 ...

55
$ RW : num 6.7 7.7 7.8 7.9 8 9 9.9 9.1 9.6 10.5 ...
$ CL : num 16.1 18.1 19 20.1 20.3 23 23.8 24.5 24.2 25.2 ...
$ CW : num 19 20.8 22.4 23.1 23 26.5 27.1 28.4 27.8 29.3 ...
$ BD : num 7 7.4 7.7 8.2 8.2 9.8 9.8 10.4 9.7 10.3 ...

Descriptive measures for numeric data

• The classical summary statistics for numeric data are the mean, the standard deviation or variance,
correlation and covariance matrix. Other measures are the median and quantiles as well as the extreme
values.
• A good overview is provided in R using summary.
• The mean and median can be also obtained using functions of the same name.
• The functions for variance and standard deviation have also the obvious function names var and sd.
• Quantiles can be calculated using the quantile function.

Crabs (numeric) summary

> summary(crabs)
sp sex index FL RW CL
B:100 F:100 Min. : 1.0 Min. : 7.2 Min. : 6.5 Min. :14.7
O:100 M:100 1st Qu.:13.0 1st Qu.:12.9 1st Qu.:11.0 1st Qu.:27.3
Median :25.5 Median :15.6 Median :12.8 Median :32.1
Mean :25.5 Mean :15.6 Mean :12.7 Mean :32.1
3rd Qu.:38.0 3rd Qu.:18.1 3rd Qu.:14.3 3rd Qu.:37.2
Max. :50.0 Max. :23.1 Max. :20.2 Max. :47.6
CW BD
Min. :17.1 Min. : 6.1
1st Qu.:31.5 1st Qu.:11.4
Median :36.8 Median :13.9
Mean :36.4 Mean :14.0
3rd Qu.:42.0 3rd Qu.:16.6
Max. :54.6 Max. :21.6

Crabs (numeric) descriptive statistics I

> mean(crabs$RW) # the mean


[1] 12.74
> median(crabs$RW) # the median
[1] 12.8
> var(crabs$RW) # the variance
[1] 6.622
> sd(crabs$RW) # the standard deviation
[1] 2.573
> quantile(crabs$RW) # the default quantiles
0% 25% 50% 75% 100%
6.5 11.0 12.8 14.3 20.2
> quantile(crabs$RW, seq(0, 1, 0.2)) # "my" quantiles

56
0% 20% 40% 60% 80% 100%
6.50 10.68 11.96 13.50 14.82 20.20

Crabs (numeric) descriptive statistics II


For the covariance matrix and the correlation matrix only the numeric variables should be selected first.

> crabs.numeric <- crabs[,-(1:3)]


> round(cov(crabs.numeric), 2)
FL RW CL CW BD
FL 12.22 8.16 24.36 26.55 11.82
RW 8.16 6.62 16.35 18.24 7.84
CL 24.36 16.35 50.68 55.76 23.97
CW 26.55 18.24 55.76 61.97 26.09
BD 11.82 7.84 23.97 26.09 11.73
> round(cor(crabs.numeric), 2)
FL RW CL CW BD
FL 1.00 0.91 0.98 0.96 0.99
RW 0.91 1.00 0.89 0.90 0.89
CL 0.98 0.89 1.00 1.00 0.98
CW 0.96 0.90 1.00 1.00 0.97
BD 0.99 0.89 0.98 0.97 1.00

Descriptive measures for categorical data


• Categorical data is usually displayed in contingency tables. The user can choose if they rather want to see the
absolute frequencies or the relative ones.
• The table with absolute values is created using the function table.
• However applying the prop.table on a table computes the relative cell counts.

Tables for the crabs data

> table(crabs$sex)

F M
100 100
> tab <- table(crabs$sex, crabs$sp)
> tab

B O
F 50 50
M 50 50
> prop.table(tab) # total percentages

B O
F 0.25 0.25
M 0.25 0.25
> prop.table(tab, 1) # row percentages

B O
F 0.5 0.5
M 0.5 0.5
> prop.table(tab, 2) # column percentages

57
B O
F 0.5 0.5
M 0.5 0.5

The functions apply, tapply and sapply


• Often one wants a descriptive statistic for several variables of separately for different groups.
• R provides for this the functions apply, tapply and sapply.
• apply
Applies a function for each row or column of a data frame.
• tapply
Applies a function to usually a vector for each unique level-combination of an indicator.
• sapply
A generalization of apply especially for lists.

The functions apply, tapply and sapply on the crab data

> apply(crabs.numeric, 2, median)


FL RW CL CW BD
15.55 12.80 32.10 36.80 13.90
> tapply(crabs$FL, list(crabs$sp, crabs$sex), median)
F M
B 13.15 15.1
O 18.00 16.7
> sapply(crabs.numeric, median)
FL RW CL CW BD
15.55 12.80 32.10 36.80 13.90

The function by
• The function by is a very nice wrapper of the function tapply when using data frames. It can apply to all
variables of the data set the function intended for each unique level of an indicator variable.
• Probably the easiest way to get a nice groupwise summary for a data frame. Note however that the function
must work on data frames!

> by(crabs.numeric, list(crabs$sp, crabs$sex), colMeans)


: B
: F
FL RW CL CW BD
13.27 12.14 28.10 32.62 11.82
------------------------------------------------------------
: O
: F
FL RW CL CW BD
17.59 14.84 34.62 39.04 15.63
------------------------------------------------------------
: B
: M
FL RW CL CW BD
14.84 11.72 32.01 36.81 13.35
------------------------------------------------------------
: O

58
: M
FL RW CL CW BD
16.63 12.26 33.69 37.19 15.32

The function by on the crab data II


If there were no function like colMeans, then a wrapper would have to be written:

> colSd <- function(x, ...) sapply(x, sd, ...)


> by(crabs.numeric, list(crabs$sex), colSd)
: F
FL RW CL CW BD
3.538 2.741 6.703 7.381 3.343
------------------------------------------------------------
: M
FL RW CL CW BD
3.463 2.161 7.471 8.330 3.494

The function with


Sometimes it is easier to use the function with which is used as follows

with(DATA, function(var.name,...))

For example:

> with(crabs, table(sex, sp))


sp
sex B O
F 50 50
M 50 50

The function aggregate


• Similar to using by also the function aggregate can be used to summarize data. It is often more convenient.
• For example the median of the RW variable of the crab data can be obtained for each species and sex as

> with(crabs, aggregate(cbind(RW, FL, CL),


+ list(sp = sp, sex = sex), median))
sp sex RW FL CL
1 B F 12.20 13.15 27.90
2 O F 14.65 18.00 34.70
3 B M 11.70 15.10 32.45
4 O M 12.10 16.70 33.35

Statistical models in R
Summary statistics give only a glimpse at the data and often of inference and/or modeling is the actual goal of the
analysis. R provides a lot of statistical tests as well as a lot of modeling functions. Before we can however use them
we have to learn something about R’s formulae definitions to be able to define models in R.
A basic formula in R has the form

y ~ x1 + x2 + x3

where the part left of ~ is the dependent variable and the right part defines the independent variables.

59
Formulae and intercept
The intercept in a model formula is represented by a 1. By default R assumes that an intercept is present, therefore
mentioning the intercept or not makes no difference. If however the intercept should be removed a -1 is needed in
the formula.
These two models are equivalent, both have an intercept:
y ~ x1 + x2 and y ~ x1 + x2 + 1
The same model without intercept must be defined as:
y ~ x1 + x2 - 1

Interactions and nested designs


Often in statistical models interactions between variables are suspected or variables are nested. This can be formulated
also using R formulae. Several special operators are available for this. To name a few:

• :
Used for interactions like x1 : x2
• *
Main effects plus interactions, like x1 * x2 = x1 + x2 + x1 : x2.
• ˆ
Factor crossing up to a certain degree, like (x1 + x2 + x3)ˆ2 = x1 + x2 + x3 + x1:x2 + x1:x3 + x2:x3.
• -
Removing terms, like (x1 + x2 + x3)ˆ2 - x2:x3 = x1 + x2 + x3 + x1:x2 + x1:x3.

Variable transformations in formulae


Common practice is to use transformations of variables in statistical models. This can be done in R directly in the
model formula. For example:
log(y) ~ x1 + x2 + sin(x3) is a correct formula.
However, due to the definition of interactions and so on, the special function I is of interest here. This function
interprets the operators used inside it as expressions in their original meaning. For example:

• y ~ I(x1 - 1) extracts from x1 one unit before it enters the model and not the intercept. This is therefore
different from y ~ x1 - 1.
• y ~ I(x1ˆ2) squares variable x1 and has nothing to do with factor crossing.

Basic graphs with R


• One of the major reasons of the popularity of R are its powerful resources to produce high quality graphs.
• R can be used to produce graphs in all kinds of formats like .ps, .eps, .jpg, .png, .pdf, . . .
• R has several graphic systems to produce graphs which follow different philosophies. We will shortly introduce
the different systems but then focus on the so-called traditional graphic system.
• The basic idea when creating a graph in R is that in the interactive mode R loads a (default) graphic device
and its drivers. The driver opens, when the first plotting command is executed, a new window which will be
used for the graph. All future plots will be displayed in that window if not a new device is opened.

60
Different graphic systems in R
The different graphic systems in R are:

1. Traditional graphic system


The most basic one which has functions for almost all common plots. The defaults are already pretty reasonable.
2. Trellis plots
Trellis plots are provided via the package lattice and follow the design ideas of Bill Cleveland.
3. Grammar of Graph plots
The package ggplot (better nowadays:ggplot2) implement graphs based on the grammar of graphs ideas.
Some see it as a combination of Traditional Graphics and Trellis plots.
4. Grid
The package grid provides tools to create graphic scenes. Though the normal user won’t really need it since
this produces plots basically from scratch. But it can also be used to add output to graphs made with lattice.

What graphs are possible


Almost all graphs are possible with R. Basically because all could be produced from scratch using grid. However for
almost all plots the “normal” user needs, ready made functions are available. To get an overview of what is possible
one should check:

> demo("graphics")
> library(lattice)
> demo("lattice")

or for ggplot check http://ggplot2.tidyverse.org/reference/.


Also very interesting in this context is the R graph gallery https://www.r-graph-gallery.com/ and the package vcd
for visualizing categorical data.

Command types for traditional graphics


From now on we will consider only traditional graphic commands. Those functions can be divided into three groups:

• High-level plotting functions create a new plot in the graphics window.


• Low-level plotting functions add information to an existing plot.
• Interactive graphics functions add or extract information to / from an existing plot using devices as a
mouse (not shown here. If interested check out locator(), identify()).

High-level plotting functions


• High-level plotting commands create a complete plot for the data passed on. They create always a new plot
and overwrite if necessary the last one. High plotting commands are generic functions which produce different
types of plots depending on the class of the data to plot.
• High-level plotting commands also add if appropriate default axes, labels and so on to the plot unless requested
otherwise. One of the most frequent high-level plotting commands is the command plot.

The function plot I


The main plotting function is plot. It has methods for most classes. To get an overview see methods(plot). The
most usual applications are:

• plot(x,y)
produces a scatter plot if x and y are numeric.

61
• plot(X)
produces a scatter plot matrix if X is a data frame.
• plot(x) produces a scatter plot of x against its index vector if x is numeric.
• plot(x)
produces a bar plot if x is a factor.
• plot(x,y)
produces a spine plot if x and y are factors.

The function plot II


• plot(x ~ y)
produces a boxplot of x for each level of y if x is numeric and y is a factor.
• plot(x ~ y1 + y2 + y3)
produces a series of plots where x is plotted against each term of the right side of the formula separately. The
type of the plot depends thus on the type of arguments.
• plot(x)
produces a several regression diagnostic plots if x is an lm-object (a linear model object).

More high-level functions I


Other high level plotting functions are:

• pairs(X)
produces a scatter plot matrix if X is a matrix or data frame.
• coplot(x1 ~ x2 | x3)
produces a number of scatterplots of x1 against x2, given values of x3 (in case x3 is a factor it produces a
scatter plot for each factor level)
• matplot(X,Y)
plots the columns of the matrix X against the columns of matrix Y.

More high-level functions II


For representing 3 dimensional data:

• image(x,y,z)
plots a grid of rectangles along the ascending x, y values and fills them with different colours to represent the
values of z.
• contour(x,y,z)
draws a contour plot for z.
• persp(x,y,z)
draws a 3D surface for z.

Special statistical high-level plotting functions I


• qqnorm(x)
plots the quantiles of x against the one of the normal distribution.
• qqplot(x,y)
plots the quantiles of x against the quantiles of y.
• stem(x)
plots a stem and leaf plot.
• hist(x)
plots a histogram.
• barplot(x)
plots a bar plot.

62
Special statistical high-level plotting functions II
• dotchart(x)
plots a dotchart.
• stripchart(x)
produces a 1D scatterplot.
• boxplot(x) produces a boxplot.
• pie(x)
produces a pie chart.
• curve(expr)
draws the given expression.

Arguments for high-level plotting commands I


R plots look most times already pretty good by default, though one normally needs to customize some settings. There
are a lot of arguments for high-level plotting commands to do so. Here is a selection of some of them. However, they
don’t work for all of functions.

• add=TRUE
forces the function to act like a low-level plotting command, “adds” the plot to an already existing one.
• axes=FALSE
suppresses axis, useful when custom axes are added.
• log="x", "y" or "xy"
Logarithmic transformation of x, y or both axes.
• type=
controls the type of the plot, the default is points.

Arguments for high-level plotting commands II


• xlab="text"
changes x-axis label (default usually object name).
• ylab="text"
changes y-axis label (default usually object name).
• main="text"
adds a title at the top of the plot.
• sub="text"
adds a subtitle just below the x-axis.

Types of plots
The default for the type= argument is "p" which makes individual points for each observations. Other options for
this argument are:

• "l": lines between the points.


• "b": plots points and connects them with lines.
• "o": points overlaid by lines.
• "h": draws vertical lines to the zero axis.
• "s": step function, top of the vertical defines the point.
• "S": step function, bottom of the vertical defines the point.
• "n": no plotting, plots however the default axes and coordinates according to the data (might be needed in
order to continue with low-level plotting commands).

63
Low-level plotting functions
Sometimes the result form the high-level plotting commands need to get “improved”. This can be done by low-level
plotting commands which add additional information (like extra points, lines, legend, . . . ) to an existing plot.
There are plenty of low-level plotting commands available.
In the following only a few of them will be introduced.

Adding lines and points


The functions points and lines add points or lines to the current plot. The different types can also be specified
using the type= argument.
The function abline is however often more convenient to add straight mathematical lines. It can be used in the
following ways:

• abline(a, b) adds a line with intercept a and slope b.


• abline(h = y) y defines the height of a horizontal line.
• abline(v = x) x defines the x coordinate for a vertical line.
• abline(lm.object) if lm.object is a list of length 2 it adds a line using the first value as intercept and the
second as slope.

Note:
Polygons can be added with the function polygon.

Adding text
The function text adds text to a plot at specified coordinates. Usage:
text(x,y,labels,...)
Which means that labeli is put to the position (xi , yi ).
A common application for this is:
plot(x, y, type = "n")
text(x, y, names)
Note:
Also mathematical symbols and formulae can be added as text, then the labels are rather expressions. For details
see help for plotmath.

Adding a legend
The function legend adds a legend to a specified position in the plot.
Usage:
legend(x,y,legend,...)
In order to let R know what is the “connection” to the graph, at least one of the following options has to be specified.
The specification v must have the same length as legend.

• fill=v colours of filled boxes.


• col=v colours of points or lines.
• lty=v line styles.
• lwd=v line widths.
• pch=v plotting characters.

64
Customizing axes
In R one can add several axes to a plot. The function to use is axis. You can specify for the axis the side, position,
label, tick and so on.
Usage:
axis(side,...)
The side of the plot is defined this way:
1=below, 2=left, 3=above and 4=right.
This function is mainly used when in the high-level plotting function the argument axes was set to FALSE.
Note:
If one wants ticks at an axis of a 1D plot for every observed value the function rug can be used.

Titles with low-level commands


The low-level plotting function title can be used to add also titles and subtitles to an existing plot.
The positions will be the same as when using the corresponding arguments of the high-level plotting commands.
Usage:
title(main, sub)

Graphic parameters
Always when a graphic device gets activated, a list of graphical parameters is activated. This list has certain default
settings. Those default settings are often however not satisfying and should be changed.
Changes can be done permanently in order to affect all plotting functions submitted to that device or only for one
plotting function call. With graphical parameters one can change almost every aspect of a graphic. All graphic
parameters have a name. In the following, some of them are introduced.

Permanent changes of the graphical settings for a device


The function par is used to modify and access the graphical parameters permanently for all future calls of plotting
functions.
The changes are always globally, independent from wherever you call it. Submitting only

> par()

gives a list with all graphical parameters and their current settings.

> par(c(parameter_1, paramter_2, ...))

provides only the settings of the parameters given in the vector.


To change those settings specify in the function the parameter and its new value

> par(parameter_1 = value, parameter_2 = value, ...)

Temporary changes of the graphical parameter settings


Almost all graphical parameters can also specified in the different plotting functions and effect then only for that one
call the setting.
For instance:

65
> plot(x,y, pch="*")

produces a scatter plot with * as plotting character instead of using a point.

Important graphic parameters I


• pch=
specifies the character used for the plotting. The character can be directly specified submitting it in quotes or
indirectly by providing an integer between 0 and 18.
• lty=
specifies the line type with an integer from 1 onwards.
• lwd=
specifies the line width in multiples of the default width.

Important graphic parameters II


• col=
specifies the colour of the symbols, text and so on. For every graphic element exits a list with the possible
colours. The value needed here is the index of the colour in the list.
• an integer to specify the font of the text (1=plain, 2=bold, 3=italic, 4 bold and italic). font.axis,
font.lab",font.mainandfont.sub‘ analogous.
• cex=
specifies the character expansions in times of the default.

Axis and tick marks


Axes in R consist of 3 components: axis line, tick marks and tick labels.
All 3 components can be customized.

• lab=c(x,y,n)
x specifies the number of ticks at the x-axis, y at the y-axis, n the length of the tick labels in characters
(including decimal point).
• las=
orientation of axis labels (0=parallel, 1=horizontal, 2=perpendicular).
• mgp=c(d1,d2,d3)
positions of axis components (details see manual).
• tck=
length of the tick marks.
• xaxs=
style of the x-axis (possible settings, “s”, “e”, “i”, “r”, “d”) y-axis analogous.

Figure margins
A single plot in R is called a figure. A figure contains as well the actual “plotting area” as the surrounding margins.
The borderline between margin and plotting area are normally the axes. The margins contain the labels, titles and
so on.
A graph of the plotting area can be seen on the next slide.
There are two arguments to control the margins. The argument mai sets the margins measured in inches, whereas
the argument mar measures them in number of text lines. The margins themselves are divided into four parts: the
bottom is part 1, left part 2, top part 3 and light part 4. The different parts are addressed with the corresponding
index of the margin vector.
For instance:
mai=c(1,2,3,4) (1 inch bottom, 2 inches left, 3 inches top, 4 inches right)
mar=c(1,2,3,4) (1 line bottom, 2 lines left, 3 lines top, 4 lines right)

66
Figure regions

Figure 1: Taken form the R Introduction manual.

Multiple figures in one graphic window


In R, it is possible to put several figures into one window. Each figure still has its own plotting area and margins,
but in addition one can add optionally a surrounding overall outer margin.
To do that one has to define an array which sets the size of the multiple figures environment.
The two functions mfcol and mfrow define such an environment, the only difference is that mfcol fills the array by
columns and mfrows by rows.
The next slide shows a multiple figure environment which could have been created using
mfcol=c(3,2) or mfrow=c(3,2).
The outer margins (by default 0) can be set using the oma and omi arguments analogous as the mar and mai arguments.
Text to the outer margins can be added using the mtext function.

Multiple figure regions

Device drivers
R can create for almost all types of driver displays or printing devices graphics. However, R has to be told before
making the figure, which device should be applied - therefore the device driver has to be specified.
help(Devices)
provides a list with all possible devices. The special device of interest is activated by calling its name and specifying
the necessary options in the parentheses.
For instance:

> jpeg(file="figure.jpg",
+ width=5, height=4, bg="white")

produces a .jpg file.


To finish with a device, one should submit

67
Figure 2: Taken from the R Introduction manual.

> dev.off()

Multiple graphic devices


In R several graphic devices can be used at the same time. To start a new device one calls its name. E.g.
windows()
opens a new graphic windows when running R under windows.
Always the last opened device is the active one. To reactivate an older window one has to use the function dev.set.
dev.set(1) would for example reactivate the first device. Plotting commands affect only the active device.

Examples for plotting with R


The so far mentioned graphical possibilities of R are only a small section of what is possible. The best way to be
able to make an optimal use of R’s graphical features, is to explore them all yourself.
However, the next slides will give some examples.
All those plots are still improvable too.

Plotting example I
The following plot should give an impression of the colours, symbols and point sizes in R.

> plot(1,1,xlim=c(1,10),ylim=c(0,5),type="n")
> points(1:9,rep(4.5,9),cex=1:9,col=1:9,pch=0:8)
> text(1:9,rep(3.5,9),labels=paste(0:9),cex=1:9,col=1:9)
> points(1:9,rep(2,9),pch=9:17)
> text((1:9)+0.25,rep(2,9),paste(9:17))
> points(1:8,rep(1,8),pch=18:25)
> text((1:8)+0.25,rep(1,8),paste(18:25))

68
Plotting example I - the plot

3 45 678
4

0
9 1 2
3
1

9 10 11 12 13 14 15 16 17
2

18 19 20 21 22 23 24 25
1
0

2 4 6 8 10

Plotting example II
This plot is about putting two figures into one window.

> set.seed(1) # to make it reproducible


> x<-rnorm(80)
> breaks<-(-ceiling(max(abs(x))):ceiling(max(abs(x))))
> ylim=c(0,0.6)
> par(mfrow=c(1,2))
> hist(x,freq=F,breaks=breaks,add=F,ylim=ylim)
> curve(dnorm(x),add=TRUE)
> plot(density(x),type="l",main="Kernel Density of x"
+ ,ylim=ylim)
> hist(x,freq=FALSE,add=TRUE)
> par(mfrow=c(1,1))

69
Plotting example II - the plot

0.6 Histogram of x Kernel Density of x

0.6
0.5

0.5
0.4

0.4
Density

Density
0.3

0.3
0.2

0.2
0.1

0.1
0.0

−3 −1 0 1 2 3 0.0 −3 −1 0 1 2 3

x N = 80 Bandwidth = 0.3046

Plotting example III


This example shows parallel boxplots which are made using a model formulae. The ticks on the right and left side of
the plot indicate the observations.
The data used are the same as in the example before.

> boxplot(Gas ~ Insul, col = c("grey50","grey80"))


> rug(Gas[Insul == "Before"], side = 2)
> rug(Gas[Insul == "After"], side = 4)

70
Plotting example III - the plot

7
6
5
Gas

4
3
2

Before After

Insul

Basic data handling in R

Workflow
Before we can perform the statistical analysis, steps are required to bring the data into a decent format and to get
it ready for the analysis:

1. Importing the data.


2. Preprocessing: renaming columns, adding transformed variables, sorting the observations, merge multiple
datasets etc.
3. (Exporting the transformed data set.)
4. Exploratory analysis: descriptives and graphs.
5. Handling of missing data
6. Outlier analysis and handling.
7. (Checking of assumptions e.g., normality)

Note: Step 2-7 need not be done in order or can be done repeatedly.

Datasets available in R
Base R and a lot of add on packages have build in datasets (i.e., data.frame objects) to demonstrate the usage of
functions.
Those datasets can be loaded using the function

> data(foo)

71
This function searches following the search path for a dataset with the corresponding name.
A list of all datasets currently available can retrieved submitting only

> data()

Detailed information for a dataset is given in the datasets helpfile.

Datasets not available in R


Own data is normally stored outside of R in ascii files, excel files, databases or file types of other statistical software
packages.
R can read most formats of other software packages using the foreign package (eg. SAS, SPSS, Stata, S-Plus,
Minitab, EpiInfo, . . . ) and also access directly different DBMSs (database management systems) using different
packages. For details how to import such data see the special manual R Data Import / Export.
In this course we will assume that the data is available in ASCII format.

Importing ascii files


R has several functions to read ASCII files which differ in their flexibility and their default settings. These function
are

• scan Most flexible function, all the following functions are based on this function.
• read.table The probably user friendliest function to read tabular data, this function will be used in this course.
• read.csv Same as read.table but different default values.
• read.csv2 Same as read.table but different default values.
• read.delim Same as read.table but different default values.
• read.delim2 Same as read.table but different default values.

Reading tabular data


For most datasets which are saved in a tabular form as ASCII files read.table can be used to import the data.
read.table creates in R automatically a data frame and also tries automatically to convert each column of the data
into the right format (e.g. numeric, factor, . . . ).
Some useful arguments of read.table:

• sep What is used in the file to separate the columns?


• header Logical. Has the original file in the first row the variable names or not?
• na.string What is used in the file as symbol for missing values?
• dec What is used in the file as the decimal symbol?

Sometimes read.table makes unfortunately rather strange conversions for the different variables.
In that case the following arguments of read.table are useful:

• as.is Should the function really try to convert the variables to the “right” format?
• colClasses If you know the format of each class in advance, you can also specify them here.

Exporting tabular data


If a data frame should be saved as an ASCII file the function write.table can be used. It works basically as the
function read.table and has almost the same arguments.
However if you have a very large data frame you want to export, then the function write.matrix of the package
MASS might be more suitable since it requires much less memory.

72
data.table
Especially for large data sets data frames are not very suitable. The package data.table provides for examples for
data sets of sizes 100GB an own infrastructure to deal with that.
The corresponding function to read in the data is then fread.
In the following we will assume however a data frame.

Data preprocessing
• In the research process doing the statistical analysis takes often less time than data preprocessing.
• Preprocessing in this context means for example transformations of variables, sorting according to variables,
combining different data frames or splitting data frames.
• R might not be the most convenient tool for data preprocessing (sorry!). But it still offers a lot of tools and
most operations can be made with it.
• This section of the lecture deals with data manipulation for data frames and uses methods that are provided by
the base distribution of R though for example packages like reshape help making some transformations easier.

The function rbind


If we have two or more data frames which have the same variables and should be combined this can be easily done
using rbind when the order of the variables is the same in both frames.

> dataF1 <- data.frame(V1 = 1:3, V2 = rnorm(3),


+ V3 = factor(c(1, 2, 1)))
> dataF2 <- data.frame(V1 = 4:5, V2 = rnorm(2),
+ V3 = factor(c(2, 2)))
> rbind(dataF1, dataF2)
V1 V2 V3
1 1 -0.5687 1
2 2 -0.1352 2
3 3 1.1781 1
4 4 -1.5236 2
5 5 0.5939 2

The function cbind


If we have two or more data frames which have the different variables for the same subjects, these can be combined
using cbind when the order of the subjects is the same in both frames.

> dataF1 <- data.frame(V1 = 1:3, V2 = rnorm(3),


+ V3 = factor(c(1, 2, 1)))
> dataF3 <- data.frame(V4 = 4:6, V5 = rnorm(3))
> cbind(dataF1, dataF3)
V1 V2 V3 V4 V5
1 1 0.3330 1 4 0.3700
2 2 1.0631 2 5 0.2671
3 3 -0.3042 1 6 -0.5425

Variable names
For large data sets it is sometimes useful to see the variable names of a data frame. Or sometimes one even wants to
change those names. There are several ways to do this. One way is the function names.

73
> dataF1 <- data.frame(V1 = 1:3, V2 = rnorm(3),
+ V3 = factor(c(1, 2, 1)))
> names(dataF1) # gets the names
[1] "V1" "V2" "V3"
> names(dataF1) <- c("v1","v2","v3") # overwrites the
> # current names

This could be also done using the function colnames.


One could also give row names, this could be done using row.names (slightly different from rownames).

> row.names(dataF1)
[1] "1" "2" "3"

Note: rownames and colnames are for matrices. row.names and names are for data frames. But both versions can be
used.

Sorting data frames


• If the order of the observations is important, it is sometimes necessary to sort a data frame according to one
or several variables.
• One way to do this is using the function order to create an index vector.

> dataF1 <- data.frame(id = c(1, 11, 6, 17), V2 = 1:4,


+ V3 = factor(c(1, 2, 1, 1)))
> order.id <- order(dataF1$id)
> dataF1[order.id, ]
id V2 V3
1 1 1 1
3 6 3 1
2 11 2 2
4 17 4 1

The function merge


The functions rbind and cbind are easy to use but not very flexible. The most flexible function to combine two
datasets is called merge.

> DF1 <- data.frame(student = c("Jim","Sarah","Mike"),


+ grade = c(1, 4, 2))
> DF2 <- data.frame(student = c("Jim","Sarah","Julia"),
+ grade = c(1, 4, 1),
+ majorS = c("Epi","PH","Epi"))
>
> merge(DF1, DF2, by = c("student","grade"),
+ all = TRUE)
student grade majorS
1 Jim 1 Epi
2 Julia 1 Epi
3 Mike 2 <NA>
4 Sarah 4 PH

74
The function merge II
• The function merge(x, y, ...) performs the operations know in database management systems (e.g., SQL)
as JOIN:

– all = TRUE performs the “outer join”


– all = FALSE (default) performs the “inner join”
– all.x = TRUE, all.y = FALSE performs the “left join”
– all.x = FALSE, all.y = TRUE performs the “right join”
– (by = NULL cross join)

The function merge III

> merge(DF1, DF2, by = c("student","grade"), all = TRUE)


student grade majorS
1 Jim 1 Epi
2 Julia 1 Epi
3 Mike 2 <NA>
4 Sarah 4 PH

> merge(DF1, DF2, by = c("student","grade"))


student grade majorS
1 Jim 1 Epi
2 Sarah 4 PH

> merge(DF1, DF2, by = c("student","grade"), all.x = TRUE)


student grade majorS
1 Jim 1 Epi
2 Mike 2 <NA>
3 Sarah 4 PH

The function merge IV

> merge(DF1, DF2, by = c("student","grade"), all.y = TRUE)


student grade majorS
1 Jim 1 Epi
2 Julia 1 Epi
3 Sarah 4 PH

> merge(DF1, DF2, by = NULL)


student.x grade.x student.y grade.y majorS
1 Jim 1 Jim 1 Epi
2 Sarah 4 Jim 1 Epi
3 Mike 2 Jim 1 Epi
4 Jim 1 Sarah 4 PH
5 Sarah 4 Sarah 4 PH
6 Mike 2 Sarah 4 PH
7 Jim 1 Julia 1 Epi
8 Sarah 4 Julia 1 Epi
9 Mike 2 Julia 1 Epi

75
The function reshape
• A special case of data is longitudinal data (panel data).
• There are two ways how you can get the data, once with the repeated measurements in individual columns and
once below each other.
• Depending on the analysis you might need one or the other form. The function reshape can change this.

The function reshape: long to wide

> DFlong <- data.frame(id = rep(1:3, c(3, 4, 3)),


+ time = c(1:3, 1:4, 2:4),
+ meas = rnorm(10))
> DFlong
id time meas
1 1 1 1.58683
2 1 2 0.55849
3 1 3 -1.27659
4 2 1 -0.57327
5 2 2 -1.22461
6 2 3 -0.47340
7 2 4 -0.62037
8 3 2 0.04212
9 3 3 -0.91092
10 3 4 0.15803
> reshape(DFlong, timevar = "time", idvar = "id", direction="wide")
id meas.1 meas.2 meas.3 meas.4
1 1 1.5868 0.55849 -1.2766 NA
4 2 -0.5733 -1.22461 -0.4734 -0.6204
8 3 NA 0.04212 -0.9109 0.1580

The function reshape: wide to long

> DFwide <- data.frame(id = 1:3, meas1 = rnorm(3), meas2 = rnorm(3),


+ meas3 = rnorm(3), meas4 = rnorm(3))
> DFwide
id meas1 meas2 meas3 meas4
1 1 -0.6546 0.9102 -0.6357 -0.6507
2 2 1.7673 0.3842 -0.4616 -0.2074
3 3 0.7167 1.6822 1.4323 -0.3928
> reshape(DFwide, varying = list(2:5), direction = "long",
+ timevar = "time")
id time meas1
1.1 1 1 -0.6546
2.1 2 1 1.7673
3.1 3 1 0.7167
1.2 1 2 0.9102
2.2 2 2 0.3842
3.2 3 2 1.6822
1.3 1 3 -0.6357
2.3 2 3 -0.4616
3.3 3 3 1.4323
1.4 1 4 -0.6507
2.4 2 4 -0.2074
3.4 3 4 -0.3928

76
The function split
If for example for each level of a factor a separate data.frame is wished, the function split can be used. It saves
however the result in a list and the individual data frames must be extracted from there.

> dataF1 <- data.frame(V1 = 1:3, V2 = rnorm(3),


+ V3 = factor(c("a", "b", "a")))
> splitF1 <- split(dataF1, dataF1$V3)
> splitF1
$a
V1 V2 V3
1 1 -0.3200 a
3 3 0.4942 a

$b
V1 V2 V3
2 2 -0.2791 b

The function subset


When selecting observations depending on some logical expression the function subset is very useful.

> library(MASS)
> data("crabs", package = "MASS")
> subset(crabs, RW >= 15.3 & sp == "B",
+ select = FL:BD)
FL RW CL CW BD
44 18.8 15.8 42.1 49.0 17.8
47 19.7 15.3 41.9 48.5 17.8
50 21.3 15.7 47.1 54.6 20.0
97 16.7 16.1 36.6 41.9 15.4
98 17.4 16.9 38.2 44.1 16.6
99 17.5 16.7 38.6 44.5 17.0
100 19.2 16.5 40.9 47.9 18.1

The function transform


A possibility to transform permanently variables in a data frame is the function transform

> dataF1 <- data.frame(V1 = 1:3, V2 = rnorm(3))


> str(dataF1)
'data.frame': 3 obs. of 2 variables:
$ V1: int 1 2 3
$ V2: num -0.177 -0.506 1.343
> Vnew <- factor(c(1,2,1))
> dataF1 <- transform(dataF1, new = Vnew,
+ V1 = log(V1))
> str(dataF1)
'data.frame': 3 obs. of 3 variables:
$ V1 : num 0 0.693 1.099
$ V2 : num -0.177 -0.506 1.343
$ new: Factor w/ 2 levels "1","2": 1 2 1

The functions fix and edit


• In R are two simple functions to edit datasets. The functions are edit and fix.

77
• The difference between the two is, that the changes of edit have to be stored in a new dataset and fix can
overwrite the current dataset.
• Both functions will open a new window where you can edit single cells, change variables names or determine
the type of the variable.

> crabs2 <- edit(crabs) ## do manual changes, then click on Quit


> fix(crabs) # after manual changes are done, crabs will be overwritten

Missing data
• In R, missing values are represented by the symbol NA (not available).
• Often the result of an operation in which NA occurs is also set to NA.
• Many functions and procedures have an argument for handling NAs (na.rm), which if it is set to TRUE excludes
the NA observations from the respective calculation.
• Note: This corresponds to the standard procedure of many statistics programs, but may lead to different
samples in the calculations.
• Most standard models cannot deal with missing values (exceptions: boosting, decision trees. . . ).
• In any case, missing values must be investigated before an analysis can be performed.
• Options:

– elimination (see above)


– replacement:
∗ by mean, mode, median
∗ last observation carried forward (for time series)
∗ estimation through regression
∗ ...

Missing data in R
• In R, the function is.na() can be used on a vector. matrix or data frame to check which elements are NA:

> x <- c(1, 2, 5, NA, 10, NA)


> is.na(x)
[1] FALSE FALSE FALSE TRUE FALSE TRUE
> ddf <- data.frame(x = c(1, 2, 3), y = c(0, 10, NA))
> is.na(ddf)
x y
[1,] FALSE FALSE
[2,] FALSE FALSE
[3,] FALSE TRUE

Missing data: Airquality example


• For a data frame it important to know the percentage of missings in each column:

> data("airquality")
> colMeans(is.na(airquality))
Ozone Solar.R Wind Temp Month Day
0.24183 0.04575 0.00000 0.00000 0.00000 0.00000

• Also, the function complete.cases() returns a logical which is TRUE if the row contains no NAs

78
> # number of complete observations/rows
> sum(complete.cases(airquality))
[1] 111

Missing data elimination: Airquality example


• If a variable has many missings, it can be excluded from further analysis.
• However, typically incomplete observations are excluded. (default treatment applied by functions like lm()).

> airquality2 <-


+ airquality[complete.cases(airquality), ]
> ## or
> airquality2 <- na.omit(airquality)

Missing data replacement


• Sometimes it would make sense to assume that if a value is missing it can be replaced by e.g., zero. This
depends on the variable and needs understanding of the data.
• If we want to replace all missings in the data set with the same value e.g., 0 we can do this by:

> ddf <- data.frame(x = c(1, NA, 3), y = c(11, 10, NA))
> ddf
x y
1 1 11
2 NA 10
3 3 NA
> ddf[is.na(ddf)] <- 0
> ddf
x y
1 1 11
2 0 10
3 3 0

Missing data replacement by mean

> ddf <- data.frame(x = c(1, NA, 3), y = c(11, 10, NA))
> ddf$x[is.na(ddf$x)] <- mean(ddf$x, na.rm = TRUE)
> ddf$y[is.na(ddf$y)] <- mean(ddf$y, na.rm = TRUE)
> ddf
x y
1 1 11.0
2 2 10.0
3 3 10.5

> ## or with for loop...


> ddf <- data.frame(x = c(1, NA, 3), y = c(11, 10, NA))
> for (i in 1:ncol(ddf)) {
+ ddf[, i][is.na(ddf[, i])] <- mean(ddf[, i], na.rm = TRUE)
+ }
> ddf
x y
1 1 11.0
2 2 10.0
3 3 10.5

79
Question: Would you use the mean or the median for imputation for the airquality data? How could you decide?

Data preprocessing: Outliers


• Outliers in data can distort results of models which are not designed to deal with extreme values (i.e., non-
robust).
• Detection:

– univariate - observations that lie outside 1.5*IQR (IQR is “Inter Quartile Range” is the difference between
75th and 25th quantiles); in boxplot can be visualized by the points outside the whiskers.
– multivariate
∗ defined within the scope of a model (e.g., based on Cook’s distance, which we will encounter in the
regression chapter).
∗ observations which are anomalous based on all the variables under investigation (detected using
unsupervised learning algorithms for anomaly detection)

Outlier handling
• Elimination (not advised!)
• Imputation - same as missing values
• Capping - e.g., setting all values above (below) a certain quantile to the value of a quantile.
• Use methods in the statistical analysis which are robust to this issue.

Univariate outliers: Example

> vv <- airquality[, "Ozone"]


> bxp <- boxplot(vv)
> text(bxp$group, # the x locations
+ bxp$out, # the y values
+ rownames(airquality)[which(vv %in% bxp$out)],
+ pos = 4)

117
150

62
100
50
0

Outlier replacement by upper quantile


We can replace the outliers in the ozone variable by the 95% quantile by

80
> airquality$Ozone[which(airquality$Ozone %in% bxp$out)] <-
+ quantile(airquality$Ozone, 0.95, na.rm = TRUE)

Further R topics

Workspace and search path


• All objects created by the user during a session form the current workspace. And the content can be managed
with functions like objects, ls and rm.
• Functions and other R objects provided by base R or packages are not in the workspace.
• If one calls an object in R, R will look for it first in the workspace and if it cannot find it there in the next
environment and so on.
• The order how R is searching for objects is called the search path.

Workspace and search path II


The current search path during a session can be obtained submitting search() and looks for example like

> search()
[1] ".GlobalEnv" "whiteside" "package:MASS"
[4] "package:stats" "package:graphics" "package:grDevices"
[7] "package:utils" "package:datasets" "package:methods"
[10] "Autoloads" "package:base"

In this case .GlobalEnv corresponds to the workspace.

Packages for R
• R is open source software and users submit all the time new functions. These functions are normally submitted
as packages.
• The base version of R comes however only with a few selected packages. Other packages must be first installed
and the easiest way is to use the menu for it.
• But though the packages are installed they are still not available for the user at the beginning of an R session
(Besides a few basic packages which are loaded automatically). Add on packages should be loaded by the user
when they need them.
• If a package is loaded can it be seen in the search path.
• Packages can be loaded using the menu or as

> library(foo)

• Sometimes it is also necessary to remove packages from the search path. This can be done by submitting

> detach("package:foo")

81
Citing R
• R comes for free and a lot of people contribute to it. They don’t want any money from you when you use it,
they however would like to be acknowledged when you are using their work.
• Therefore it is appreciated if you cite R and special packages when you use them for your work. If you want
to know how R or packages want to be cited, use the function citation.
• For R in general:

> citation()

To cite R in publications use:

R Core Team (2021). R: A language and environment for statistical


computing. R Foundation for Statistical Computing, Vienna, Austria.
URL https://www.R-project.org/.

A BibTeX entry for LaTeX users is

@Manual{,
title = {R: A Language and Environment for Statistical Computing},
author = {{R Core Team}},
organization = {R Foundation for Statistical Computing},
address = {Vienna, Austria},
year = {2021},
url = {https://www.R-project.org/},
}

We have invested a lot of time and effort in creating R, please cite it


when using it for data analysis. See also 'citation("pkgname")' for
citing R packages.

• For R packages:

> citation("MASS") # for citing packages, in this case the package MASS

To cite the MASS package in publications use:

Venables, W. N. & Ripley, B. D. (2002) Modern Applied Statistics with


S. Fourth Edition. Springer, New York. ISBN 0-387-95457-0

A BibTeX entry for LaTeX users is

@Book{,
title = {Modern Applied Statistics with S},
author = {W. N. Venables and B. D. Ripley},
publisher = {Springer},
edition = {Fourth},
address = {New York},
year = {2002},
note = {ISBN 0-387-95457-0},
url = {https://www.stats.ox.ac.uk/pub/MASS4/},
}

Saving the workspace


• The entire workspace can be saved at any time during an session using the menu or the command save.image().
By default, save.image() will save the workspace as a .RData file in the working directory. (See the help of
the function if you want to save it somewhere else.)

82
• Unless otherwise specified in global settings, R will ask before it is closed if the current working space should
be saved. In that case it will load the saved working space at the start of the next session.
• Saving the whole workspace is typically not recommended. See e.g., discussion here.

Saving only parts of the workspace


• That R has every object created in the workspace has advantages and disadvantages.
• The advantage is, everything is ready at hand and quickly available. The main disadvantage is, that it is very
exhaustive for the memory which will be a problem for analyses of large data sets.
• Often also only a few objects of a workspace are really of interest and worth saving (maybe output of some
computationally intensive models).

> save(foo1, foo2, file = "foo.RData")

• Objects saved on previous sessions can be loaded into new session using load.

> load(file = "foo.RData")

Note: This will save the file in the working directory.

Working directory
• The working directory is the path where R will search by default for files to read or where R will by default
save files.
• The current working directory can be obtained or changed using functions getwd() and setwd() or the menu.

> getwd()
[1] "/Users/lauravanagur/Documents/Teaching/CompStat"
> # try, but does not work in Rmarkdown
> # setwd("/Users/lauravanagur/Documents/")
> # getwd()

• When opening a file with RStudio, it automatically sets the working directory to the location of the file. In
Rmarkdown it is automatically the location of the .Rmd.

Relative vs. absolute paths


• Paths of can be specified in an absolute way or in a relative way (i.e., relative to the working directory).
• Relative paths are useful for reproducibility. They ensure that if another user has the same folder structure,
they can use the same paths.
• Assume I want to read a data set dat.csv from the folder Practicals/Datasets which is in my working
directory:

> ## getwd()
> ## "/Users/lauravanagur/Documents/Teaching/CompStat/Slides"
> dat <- read.csv("Practicals/Datasets/dat.csv")

Scripts
• Scripts written in editors are usually saved in files with ending .r or .R.
• These files can be loaded from within R.
• To load a whole script the function ‘source” is used.

83
> ## again, with relative path...
> source("Rscript.R")

• The source command will by default create all objects that are defined in the file however produces no output.
Output will only be produced if the the object in the file is forced to be printed using the print function.

The working history


• If one is not interested in the objects created during a session but in all the commands used during it, then the
history is of interest.
• The easiest way to save the history and reload an old history is to use the menu. But of course it can also be
done using functions.
• The corresponding functions are savehistory and loadhistory.

Dates and times in R


• Dates and times are among the most complicated types to work with on computers.

– Standard calendar is complicated (leap years, months of different lengths, historically different calendars
- Julian vs. Gregorian).
– Times depend of an unstated time zone (add daylight savings :-() and some years have leap seconds to
keep the clocks consistent with the rotation of the earth!

• R can flexibly handle dates and times and has different classes for them with different complexity levels.
• Most classes offer then also arithmetic functions and other tools to work with date and time objects.
• A good overview over the different classes is given in the Helpdesk section of the R News 4(1).
• The builtin as.Date() function handles dates (without times).
• The contributed library chron handles dates and times, but does not control for time zones.
• The POSIXct and POSIXlt classes allow for dates and times with control for time zones.
• The various as. functions can be used for converting strings or among the different date types when necessary.

The Date class


• Objects of class Dates are obtained by the as.Date() function and can represent years since 1 AD and are
normally created by transforming a character vector which contains information about the day, month and
year.
• Note that the behavior of the function when some of the information is missing depends on your system! Also
if some of the formats can be read depends on your locale.
• To get the current date:

> Sys.Date()
[1] "2022-01-20"

as.Date() function
• The as.Date() function allows for a variety of formats through the format= argument.

84
Code Value
%d Day of the month (decimal number)
%m Month (decimal number)
%b Month (abbreviated)
%B Month (full name)
%y Year (2 digit)
%Y Year (4 digit)
%C Century

Note: %y is system dependent so must be employed with care.

The Date class example

> x <- c("02.06.1987","6/21/98","31may1960")


> as.Date(x[1], format = "%d.%m.%Y")
[1] "1987-06-02"
> as.Date(x[2], format = "%m/%d/%y")
[1] "1998-06-21"
> as.Date(x[3], format = "%d%b%Y")
[1] "1960-05-31"
> lct <- Sys.getlocale("LC_TIME")
> lct # my locale is in English
[1] "en_US.UTF-8"
> Sys.setlocale("LC_TIME", "de_DE.UTF-8") # different for different OSs
[1] "de_DE.UTF-8"
> as.Date(x[3], format = "%d%b%Y") # may doesn't work for German...
[1] NA
> Sys.setlocale("LC_TIME", lct) # reset to English
[1] "en_US.UTF-8"

Internal storage of dates and times


• Except for the POSIXlt class, dates are stored internally as the number of days or seconds from some reference
date.
• Thus dates in R will generally have a numeric mode, and the class function can be used to find the way they
are actually being stored.
• The POSIXlt class stores date/time values as a list of components (hour, min, sec, mon, etc.) making it easy
to extract these parts.
• E.g., internally, Date objects are stored as the number of days since January 1, 1970, using negative numbers
for earlier dates.

> y <- as.Date(x[3], format = "%d%b%Y")


> y
[1] "1960-05-31"
> as.numeric(y)
[1] -3502

Finnish social security number


Every Finn and foreigner working in Finland gets a Finnish social security number. The number is used as a personal
identifier and therefore unique for each individual.
The structure of this number is: DDMMYY C ZZZQ where

85
• DDMMYY gives the date of birth.
• C specifies the century of birth. + = 19th Cent., - = 20th Cent. and A = 21st Cent.
• ZZZ is the personal identification number. It is even for females and odd for males
• Q is a control number or letter to see if the total number is correct.

Extracting information from Finnish ids


We know now that the Finnish social security number contains a lot of useful information. We will discuss now, how
to extract the birthday and sex of an individual from their id using R.
The following functions will be needed for this task:

• substr extracts a substring from a character vector.


• paste can be used to collapse elements of different vectors to one.
• ifelse does a vectorized evaluation of an expression. ifelse(expression, A, B), so if expression is true, the
result will be A and if false B.
• %in% is a logical operator which is TRUE if the for the right side is a match in the left side of the and otherwise
FALSE.

We will use three Finnish fake ids

> x <- c("010199-123N", "301001A1234","130620-4567")


> xDates <- substr(x, 1, 6)
> xSex <- substr(x, 10, 10)
> centuries <- ifelse(substr(x, 7, 7) == "+", 19,
+ ifelse(substr(x, 7, 7) == "-", 20, 21))
> x2Dates <- paste(xDates, centuries - 1, sep = "")
> birthDates1 <- as.Date(xDates, format = "%d%m%y") # wrong
> birthDates2 <- as.Date(x2Dates, format = "%d%m%y%C")
> sex <- ifelse(xSex %in% c(0, 2, 4, 6, 8),"Female","Male")
> cbind.data.frame(sex, birthDates1, birthDates2)
sex birthDates1 birthDates2
1 Male 1999-01-01 1999-01-01
2 Male 2001-10-30 2001-10-30
3 Female 2020-06-13 1920-06-13

Debugging
There are two commonly referred claims:

1. Programmers spent more time on debugging their own code that actually programming it.
2. In every 20 lines of code is at least one bug.

Hence debugging is an essential part of programming and there are strategies and tools available in R to do this well
in R.

Top-down programming
• General agreement is that good code is written in a modular manner. This means when you have a procedure
to implement, you decompose it into small parts where each part will become an own function.
• Then the main function is “short” and will consist mainly of calling these subfunctions.
• Naturally also within these functions the same approach is to be taken.
• Then same approach is followed in debugging. First the top level function is debugged and all subfunctions are
assumed correct. If this does not yield a solution, then the next level is debugged and so on.

86
Small start strategy
• The small start strategy in debugging suggests to start using small test cases for the debugging.
• Once these work fine, then consider larger testing cases.
• At that stage also extreme cases should be tested.

Antibugging
• Also some antibugging strategies are useful in this context.
• Assume that at line n in your code you know that variable or vector x must have some specific property, like
being positive or sum up to 1.
• Then you can add in that line in the code for debugging purposes for example

> stopifnot(x > 0)

or

> stopifnot(sum(x) == 1)

• This might help to narrow done where the bug occurred.

R functions for debugging


R provides many functions to help in the debugging process. To name some:

• browser
• debug and undebug
• debugger
• dump.frames
• recover
• trace and untrace

For details about these functions see their help pages. In the following we will look only at debug and traceback.
Note that also Rstudio offers special debugging tools, see https://support.rstudio.com/hc/en-us/articles/205612627-
Debugging-with-RStudio for details.

traceback
• Often when using functions and error occurs it is not really clear where the actually error occurs, which
(sub)function caused the error
• One strategy then is to use the traceback function, which returns when called directly after the erroneous call
the sequence of function calls which lead to the error.

traceback II

87
> f1 <- function(x) f2(x)ˆ2
> f2 <- function(x) log(x) + "x"
> mainf <- function(x) {
+ x <- f1(x)
+ y <- mean(x)
+ y
+ }
> mainf(1:3)
> traceback()

Error in log(x) + "x": non-numeric argument to binary operator


3: mainf(1:3)
2: f1(x)
1: f2(x)

debug
Assume you have a function foo you assume faulty. Using then

> debug(foo)

will open whenever the function is called the “browser” until either the function is changed or the debugging mode
terminated using

> undebug(foo)

In the “browser” the function will be executed line by line where always the next to be executed line will be
shown.

debug commands in browser mode


In the browsing mode the following commands have a special meaning:

• n (or just hitting enter) will execute the line shown and then present the next line to be executed.
• c this is almost like n just that it might execute several lines of code at once. For example if you are in a loop
then c will jump to the next iteration of the loop.
• where this prints a stack trace, the sequence of function calls which led the execution to the current location
• Q this quits the browser.

And in the browser mode any other R command can be used. However to see for example the value of a variable
nthe variable needs then to explicitly printed using print n.

Debugging demo
In a demo we will go through the following function in debugging mode

> SimuMeans <- function(m, n = 100, seed = 1) {


+ set.seed(seed)
+ RES <- matrix(0, nrow = m, ncol = 3)
+ for (i in 1:m){
+ X <- cbind(rnorm(n), rt(n,2), rexp(n))
+ for (j in 1:3){
+ RES[i,j] <- mean(X[,j])

88
+ }
+ print(paste(i,Sys.time()))
+ }
+ return(RES)
+ }
> debug(SimuMeans)
> SimuMeans(5)

Capturing errors
• Especially in simulations it is often desired that when an error occurs that not the whole process is terminated
but that the error is catched and an appropriate record made but otherwise the simulations should continue.
• R has for this purpose the function try and tryCatch where we will consider only tryCatch.
• The idea of tryCatch is to run the “risky” part where errors might occur with in the tryCatch call and tell
tryCatch what to return in the case of an error.

Capturing errors demo


Consider a modified version of our previous simulation function:

> my.mean <- function(x){


+ na.fail(x)
+ mean(x)
+ }
> SimuMeans2 <- function(m, n=100, seed=1) {
+ set.seed(seed)
+ RES <- matrix(0, nrow=m, ncol=3)
+ for (i in 1:m){
+ X <- cbind(rnorm(n), rt(n,2), rexp(n))
+ if (i==3) X[1,1] <- NA
+ for (j in 1:3){
+ RES[i,j] <- my.mean(X[,j])
+ }
+ }
+ return(RES)
+ }
> SimuMeans2(5)
Error in na.fail.default(x): missing values in object

Capturing errors demo II


Using tryCatch

> SimuMeans3 <- function(m, n=100, seed=1) {


+ set.seed(seed)
+ RES <- matrix(0, nrow=m, ncol=3)
+ for (i in 1:m){
+ X <- cbind(rnorm(n), rt(n,2), rexp(n))
+ if (i==3) X[1,1] <- NA
+ for (j in 1:3){
+ RES[i,j] <- tryCatch(my.mean(X[,j]), error = function(e) NA)
+ }
+ }
+ return(RES)

89
+ }
> SimuMeans3(5)
[,1] [,2] [,3]
[1,] 0.10889 -0.29099 1.1103
[2,] -0.04921 -0.17200 0.8624
[3,] NA -0.02305 1.0302
[4,] -0.09209 -0.27303 1.0814
[5,] -0.05374 0.13526 1.0200

Profiling
• If you know that your function is correct but think it is slow you can do profiling which helps to identify the
parts of the functions which are bottlenecks and then you can consider if these parts could be improved.
• The idea in profiling is that the software checks in very short intervals which function is currently running.
• The main functions in R to do profiling are Rprof and summaryRprof. But there are also many other specialized
packages for this purpose.

A function to profile

> Stest <- function(n = 1000000, seed = 1) {


+ set.seed(seed)
+ normals <- rnorm(n*10)
+ X <- matrix(normals, nrow=10)
+ Y <- matrix(normals, ncol=10)
+ XXt <- X %*% t(X)
+ XXcp <- tcrossprod(X)
+ return(n)
+ }
> system.time(Stest())
user system elapsed
0.621 0.033 0.664

A function to profile II

> Rprof(interval = 0.01)


> Stest()
[1] 1e+06
> Rprof(NULL)
> summaryRprof()$by.self
self.time self.pct total.time total.pct
"rnorm" 0.25 51.02 0.25 51.02
"%*%" 0.11 22.45 0.11 22.45
"tcrossprod" 0.06 12.24 0.06 12.24
"matrix" 0.05 10.20 0.05 10.20
"t.default" 0.02 4.08 0.02 4.08

Run it on your own computer and look at the full output of summaryRprof().

Package microbenchmark
• The contributed package microbenchmark is useful in comparing the speed of different functions.

90
• The microbenchmark() function serves as a more accurate replacement of the often seen system.time().

> if (!require("microbenchmark")) install.packages("microbenchmark")


Loading required package: microbenchmark
> library("microbenchmark")
> f1 <- function(X) t(X) %*% X
> f2 <- function(X) crossprod(X)
> X <- matrix(rnorm(2000000 * 3), ncol = 3)
> microbenchmark(f1(X), f2(X), times = 10L)
Warning in microbenchmark(f1(X), f2(X), times = 10L): less accurate nanosecond
times to avoid potential integer overflows
Unit: milliseconds
expr min lq mean median uq max neval cld
f1(X) 51.02 52.33 53.88 52.90 54.89 60.97 10 b
f2(X) 17.35 17.45 17.84 17.65 18.08 18.97 10 a

Introduction to regression modeling in R

Regression modeling in R
The following chapter give a small glimpse about the linear regression model in R.
There are many options (functions) in R available for other regression models (e.g., generalized linear models, pe-
nalized regression models etc.). We focus here only on basic linear regression. But many principles apply also when
using functions for other regression models.
Here some useful functions and packages for regressions in R:
aov ANOVA models in R
lm linear regression
glm generalized linear models like logistic regression
nls nonlinear regression
nlme package for linear and nonlinear mixed effect models
lme4 package for linear and generalized linear mixed
effect models
survival package for parametric and nonparametric survival
models

Generic functions for regression models


Almost all regression functions are called using a formula and it is common practice to assign the result to an object
like:

> foo.reg <- foo(model.formula, data = data)

This object is usually quite complex and printing it returns only minimal output. A lot of generic functions have
however methods for the different regression models. Some important ones are:

summary most important output


anova ANOVA table
fitted fitted values of the model
predict can be used to predict new observations
resid residuals
plot diagnostic plots
coef estimated parameters

91
Linear model
The linear model assumes that the relationship between the response variable (aka dependent variable, output) Y
and p independent variables (aka explanatory variables, predictors, covariates, features) X1 , . . . , Xp is linear and can
be represented as:
Y = β0 + β1 X1 + . . . βp Xp + ϵ,
where β0 is the model constant or intercept, βj is the regression coefficient corresponding to the variable Xj and ϵ is
a random error term which captures variation in Y not explained by X1 , . . . , Xp .
The model is linear in the unknown parameters βj , j = 0, . . . , p.
The variables X1 , . . . , Xp can come from different sources:

• quantitative inputs,
• transformations of quantitative inputs such as the log, square root, square,
• basis expansions e.g., X2 = X12 , X3 = X13 . . .
• numeric or dummy coding of the levels of qualitative inputs,
• interactions between variables: X3 = X1 · X2 .

For a (training) sample of n observations, we have:


yi = β0 + β1 xi1 + . . . βp xip + ϵi , i = 1, . . . , n
or in matrix notation
y = Xβ + ϵ,

where

• y = (y1 , . . . , yn ),
• X = (1, x1 , . . . , xp ) is an n × (p + 1) matrix of independent variables (including a vector of ones corresponding
to the intercept),
• β = (β0 , β1 , . . . , βp ) is the (p + 1) × 1 vector of regression coefficients (with intercept) and
• ϵ = (ϵ1 , . . . , ϵn ).

Linear model assumptions


We can state the model assumptions as follows:
(A1): X is nonstochastic, X has full rank.
(A2): E(ϵ) = 0 and cov(ϵi , ϵj ) = 0 for i ̸= j.
(A3): Homoscedasticity: cov(ϵ) = σ 2 In .
(A4): Normality: ϵ ∼ Nn (0, σ 2 In ).
And naturally all the assumptions should be checked.
Assumptions (A1) and (A2) are rather checked by investigating the design of the experiment whereas (A3) and (A4)
are based on analysis of the residuals.

Estimation using OLS


• Ordinary least squares (OLS) can be used in order to estimate the unknown parameters β.
• The OLS estimators are obtained by minimizing the residual sum of squares:
RSS(β) = (y − Xβ)⊤ (y − Xβ)
• The solution is obtained by differentiating this quadratic function with respect to β and setting the first
derivative to zero:
β̂ = (X ⊤ X)−1 X ⊤ y
Note: Here we use the full rank assumption (A1), which ensures that there is a unique solution for the unknown
parameters.

92
Conventions
Usually the following terms are used in a regression context:

• ŷ = Xβ̂ are the fitted values.


• r = y − ŷ are the residuals.
• r⊤ r is the residual sum of squares (RSS).
• H = X(X⊤ X)−1 X⊤ is the n × n “hat”-matrix which “puts the hat” on y. See ŷ = X(X⊤ X)−1 X⊤ y = Hy.
• hi = Hii is the ith diagonal value of H and called leverage. It holds that 0 ≤ hi ≤ 1 (as H is idempotent).

Leverage
The leverages hi are useful in identifying influential observations. We know that (for a model with intercept):
Pn
• i=1
hi = ncol(X) where for a model with intercept ncol(X) = p + 1,
1
• In a model with intercept, hi ≥ n

This means if a leverage hi is large, it must be due to extreme values in xi .


A rule of thumb says for example that a leverage point with leverage larger than 2ncol(X)/n should be investigated
closer.

Sampling properties of OLS estimators I


• Assuming (A2) and (A3), the variance-covariance matrix of the OLS estimator is:

var(β̂) = (X ⊤ X)−1 σ 2 .

• The variance of the errors is typically estimated by


r
1
σ̂ = RSS
n−p−1

• Under linearity and (A2)+(A4), we have

β̂ ∼ N β, (X ⊤ X)−1 σ 2


• Also,
(n − p − 1)σ̂ ∼ σ 2 χ2n−p−1

Sampling properties of OLS estimators II


• We use the distributional properties to form tests of hypothesis and confidence intervals for the parameters βj .
• To test the hypothesis that a particular coefficient βj = 0, we form the standardized coefficient which follows
a tn−p−1 distribution under the null hypothesis:

β̂j ⊤ −1
t= √ ∼ tn−p−1 , vj the jth diag elem. of (X X)
σ̂ vj

• To check the significance of groups of coefficients simultaneously we can use the F -statistic which has an F
distribution under the null:
(RSS0 − RSS1 )/(p1 − p0 )
F = ∼ Fp1 −p0 ,n−p1 −1
RSS1 /(n − p1 − 1)
where RSS0 is the RSS of the smaller model with p0 variables and RSS1 is the RSS of the larger model with
p1 variables.

93
Goodness of fit
• Before discussing the residual analysis, we recall few quantities which quantify the extent to which the model
fits the data.

– residual standard error σ̂, standard deviation of the residuals which is an estimate of standard deviation
of ϵ.
∗ Roughly speaking, it is the average amount that the response will deviate from the true regression
line.
∗ It is measured in the units of the response.
– R2 statistic (coefficient of determination), which represents the proportion of variability in Y that can be
explained by the linear regression. Note that it always increases as we add more predictors to the model.
n
RSS X
R2 = 1 − , T SS = (yi − ȳ)2
T SS
i=1

• Note that these measures only apply to the linear regression case and do not easily extend to other types of
regression.

Residuals
The realizations of the random term ϵi are not observable. Therefore we use instead the residuals ri as an estimate.
Residuals are useful to evaluate the goodness of fit of the model and to check the model assumptions but they have
some design limitations since they must fulfill
n
X
ri = 0 and X ⊤ r = 0
i=1

Furthermore, residuals do not have the same variance by construction. Their variance decreases as the x values move
further away from the average x value:

var(r) = var(y − ŷ) = var((In − H)y) = (In − H)2 σ 2 = (In − H)σ 2

var(ri ) = σ 2 (1 − hi )

There are two ways to standardize residuals to make them more useful for model diagnostics.

Standardized residuals
The standardized residuals are rescaled to have equal variances. They are computed using the leverages.
ri
r̃i = √
σ̂ 1 − hi

However, if σ̂ 2 is a bad estimate for the model variance σ 2


(as variance is not robust, it happens if one residual is very large) the inflated σ̂ 2 will flatten the standardized
residuals.

Studentized Residuals
A way to get “good” residuals when there is one bad data point is to see, what would happen if we dropped one
observation and use only the remaining n − 1 ones for the estimation. With that we predict the value of the omitted
value and can get so the so-called studentized residuals.
yi − ŷ(i)
ři = p ,
var(yi − ŷ(i) )

94
where yi is the omitted observation and ŷ(i) the prediction of yi based on a model that was fitted after excluding the
ith observation.
Note: The terminology for residuals is not everywhere the same, therefore check always carefully which definition
your software package uses.

Residual analysis
The following plots can be useful for the evaluating the model assumptions:

• residuals versus fitted values


• residuals versus predictors in the model
• residuals versus predictors not in (or deleted from) the model
• qqplot of residuals

Note that model assumptions are usually checked rather visually and not by testing.

Residuals vs. fitted values


• This is the most important residual plot.
• The scatterplot should be centered around 0 and show a constant variance.
• There should be no structure in the plot. A structure would suggest non-linearity or unequal variances.
• Sometimes the root of the standardized residuals is plotted against the fitted values. This should make it easier
to detect a trend in the dispersion.

Residuals vs. predictors plot


• The plots residuals vs. predictors in model have more or less the same purpose as residuals vs. fitted values (in
the simple linear regression case with one variable they look the same).
• You should consider the same points as mentioned for the plot of residuals vs. fitted values.
• It is also useful to plot the residuals against variables not included in the model (if available) or variables
deleted from the model.
• Checking the residuals vs. predictors plots however gets interesting if we find there some of the characteristics
we want to avoid in the other plots. Here that could be an indicator that the omitted predictor should be
included in the model.

qqplot of residuals
• The qqplot allows us to check the assumption of normality. It is recommended to use for this purpose the
standardized residuals or the studentized residuals.
• The points should lie then on the bisector.

Outliers and influential observations


• An outlier in a regression model is an observation which has extreme values regarding its response.
• High leverage points are ones which have extreme values in terms of the x values.
• Outliers and high leverage points need not necessarily pose a problem for the model.
• An influential observation is an observation which would, when excluded from the model change the pa-
rameter estimates considerably.

95
Outliers, leverage and influential points
Original data Outlier, high leverage, low residual, no influence
25

25
15

15
y

y
5

5
−5

−5
−2 0 2 4 6 8 10 −2 0 2 4 6 8 10

x x
High leverage, large residual, high influence Outlier, low leverage, large residual, low influence
25

25
15

15
y

y
5

5
−5

−5
−2 0 2 4 6 8 10 −2 0 2 4 6 8 10

x x

Example: the Anscombe quartet 9


10

7
8
y1

y2

5
6

3
4

4 6 8 10 12 14 4 6 8 10 12 14

x1 x2
12
12

10
10
y3

y4

8
8

6
6

4 6 8 10 12 14 8 10 12 14 16 18

x3 x4

• The Anscombe quartet contains four data sets of 11 observations.


• Used to illustrate the importance of visualization and several violations of the linear model assumptions.
• The variables in the four data sets have similar summary statistics and the regression lines fitted by OLS to
the different data sets are almost identical.

96
> data("anscombe")
> colMeans(anscombe)
x1 x2 x3 x4 y1 y2 y3 y4
9.000 9.000 9.000 9.000 7.501 7.501 7.500 7.501
> apply(anscombe, 2, sd)
x1 x2 x3 x4 y1 y2 y3 y4
3.317 3.317 3.317 3.317 2.032 2.032 2.030 2.031

Anscombe quartet: residual vs fitted

Anscombe 1 Anscombe 2

−0.5 0.5
1
Residuals

Residuals
0
−1

−2.0
−2

5 6 7 8 9 10 5 6 7 8 9 10

Fitted Fitted
Anscombe 3 Anscombe 4
3

0.0 1.0
Residuals

Residuals
2
1
0

−1.5
−1

5 6 7 8 9 10 7 8 9 10 11 12

Fitted Fitted

97
Anscombe quartet: qqplots of standardized residuals

Anscombe 1 Anscombe 2
Sample Quantiles

Sample Quantiles
1.0

0.5
0.0

−1.5 −0.5
−1.5

−1.5 −0.5 0.5 1.0 1.5 −1.5 −0.5 0.5 1.0 1.5

Theoretical Quantiles Theoretical Quantiles


Anscombe 3 Anscombe 4
3
Sample Quantiles

Sample Quantiles

1.0
2

0.0
1
0

−1.5
−1

−1.5 −0.5 0.5 1.0 1.5 −1.5 −0.5 0.0 0.5 1.0 1.5

Theoretical Quantiles Theoretical Quantiles

Design matrix
• As shown earlier, we assume we have a data matrix X which contains the explanatory variables.
• However the data matrix containing the variables X1 , . . . , Xp is usually not the matrix which we use in the
formulas earlier, but here X denotes the model or design matrix based upon the explanatory variables.
• For example the model matrix has usually a column of 1’s to model an intercept term.
• In the following slides we will discuss the forms explanatory variables can have to enter the model matrix. It
is important however that the design matrix has always full rank.

Continuous variables in the design matrix


• Continuous variables can enter straight as they are into the design matrix.
• Sometimes however some transformations might be reasonable. Assume we have a quantitative variable xi , a
change of scale like xi = xib−a might be of interest because

– predictors of similar magnitude are easier to compare


– might be easier to interpret
– numerical stability is increased when predictors have a similar scale

• Example: when centering the predictors, the intercept can be interpreted as the expected value of Y for average
values of the original predictors. Can be useful in some applications such as predicting house prices using m2
and number of bedrooms.

98
Continuous variables and collinearity
• A continuous variable need not necessarily enter linearly into the model. We can use transformations or add it
as a polynomial of higher order into the model.
• Adding polynomials however should be done with care cause they correlate with each other and can cause
problems when estimating parameters. (collinearity → X ⊤ X gets close to being not invertible).
• If the predictors show large amounts of correlation, either pairwise elimination can be employed or a principal
component analysis could be made and the principal components used instead of the actual variables.
• Ideally, for the ceteris paribus interpretation to hold, the predictors should be independent. This is rarely the
case in practice. If the predictor are independent, then the coefficients of the individual linear regression are
the same as the ones for the multiple linear regression.

Categorical variables in the design matrix


• A categorical variable (aka factor) with L levels is normally represented with l = L − 1 linearly independent
columns in the design matrix (often called dummy variables).
• These columns can be parameterized in several different ways. The parametrization is usually called a contrast.
• Different contrasts allow for different interpretations of the parameters. Possible contrasts are:

– treatment contrast
– sum contrasts
– helmert contrast
– polynomial contrast

• For more details see this vignette.

Treatment contrast I
• The treatment contrast is one of the most frequently used contrasts. The contrast has L − 1 columns.
• Assume we would have a factor with L = 4 levels, then the three columns would look like shown in the table.

Level [,1] [,2] [,3]


1 0 0 0
2 1 0 0
3 0 1 0
4 0 0 1

• Assume x is the original categorical variable in the design matrix, this implies that we create l = 3 columns:

1 if xi = j + 1
dij = , j = 1, . . . , L − 1
0 otherwise

Treatment contrast II
• The regression model would be

yi = β0 + β1 di1 + β2 di2 + β3 di3 + . . . + ϵi

• The interpretation for the coefficients of the dummies of levels 2-4 would then be the difference in the expected
response with respect to level 1 (assuming all other variables are 0).
• β0 + βj gives the expected response for group j.
• The effect of the first level could then be associated with the intercept β0 .

99
Sum contrast I
• The sum contrast is a popular contrast for balanced experimental designs.
• All columns in the contrast have to add up to 0.
Level [,1] [,2] [,3]
1 1 0 0
2 0 1 0
3 0 0 1
4 −1 −1 −1
• Assume x is the original categorical variable in the design matrix, this implies that we create l = 3 columns:

1 if xi = j
dij = −1 if xi = L , j = 1, . . . , L − 1

0 otherwise

Sum contrast II
• The regression model would be
yi = β0 + β1 di1 + β2 di3 + β3 di3 + . . . + ϵi
• The interpretation for the coefficients of the dummies would then be the difference in the expected response
for level or group j with respect to the overall mean (assuming all other variables are 0).
• The intercept β0 has the interpretation of the overall expected value of the response when the predictors are
set to zero.
• β0 + βj gives the expected response for group j for j = 1, . . . , L − 1 and for the Lth group the effect is
PL−1
β0 − j=1 βj .

Helmert contrast
• The helmert contrast is a popular contrast (for instance default in S-Plus).

Level [,1] [,2] [,3]


1 −1 −1 −1
2 1 −1 −1
3 0 2 −1
4 0 0 3

• The first coefficient is the mean of the first two effects minus the first effects, the second coefficient is the mean
of all three effects minus the mean of the first two levels (parameter j compares the mean of effects for levels
1 : (j + 1) with the average of all effects for preceding factor 1 : j).
• It turns out the intercept is the mean of the means.

Polynomial contrast
• The polynomial contrast is recommended for ordered equidistant factors.
• It envisages the levels of the factor as corresponding to equally spaced values of an underlying continuous
covariate.
• It forces the effects to be monotonic in factor level order.
• It is however not that easy to interpret

Level [,1] [,2] [,3]


1 −0.6708204 0.5 −0.2236068
2 −0.2236068 −0.5 0.6708204
3 0.2236068 −0.5 −0.6708204
4 0.6708204 0.5 0.2236068

100
Interactions
• A basic model assumption is that the different variables have an additive effect on the response.
• However, this is not always the case and one way to include non-additive effects in linear models is by using
interactions.
• Usually only interactions between two variables at the time are considered. The interaction terms go into the
design matrix as products of the columns of the two variables concerned.
• The interpretation of the interactions depends on the variable types of the variables involved.

Interpreting interactions
Interactions between 2 factors: This is the simplest case. Here one has basically different levels for all possible
combinations of the levels of the original factors.
Interactions between factor and numeric variable: In this case the numeric variable has still a linear effect
but now for each factor level there is a different slope.
Interactions between 2 numeric variables: This is a bit difficult to interpret. Basically if one variable is kept
fixed, then the other variable is linear where the slope depends on the value for which the other variable is kept
fixed.

Model selection
When fitting a regression model the aim is normally to find the smallest set of predictors which still describe the
data adequately well. Several strategies are available for model selection. Often different methods lead to different
models!
But what should always be considered:

• if a predictor is in the model as a polynomial, lower order parts cannot be removed.


• if you have included interactions into the model, no predictor involved in the interaction can be removed.
• the model should always have fewer predictors than there are observations.

There are no routine statistical questions, only questionable statistical routines. (D.R. Cox)

Backward selection

This method is rather simple and starts with all predictors in the model. Then we choose a “p-to-remove$ level α.
Here α does not necessarily have to be 0.05. Often a larger a like 0.1 or 0.15 is chosen.
The method works the following way:

• fit the model with all predictors,


• remove the predictor with the highest p-value larger than α,
• refit the model and repeat the last step.

The final model has all predictors with a p-value smaller than α.

Forward selection

The forward selection method is just the opposite of the backward selection. It starts with an empty model and adds
predictors to the model as long as one of the remaining predictors has smaller the “p-to-add” level α. Again α is
rather 0.1 or 0.15 than 0.05.
The method works the following way:

• compute all linear models containing one factor

101
• choose the model which has the smallest p-value for the predictor which is smaller than α;
• fit all the models with the chosen predictor and one of the remaining predictors, keep again the one, which has
the smallest p-value smaller than α;
• continue until no predictor can be added anymore.

Stepwise selection

• The stepwise selection is a combination of the backward and the forward selection methods.
• It starts with the backward selection. But always after we delete a predictor from the model we check using
the forward method if we could add one of the other deleted predictors again to the model (we can add only
one of those not deleted in the last step).
• After adding or not adding one, we continue with the backward selection until we cannot add or remove anymore
any variable.

Comments on model selection

The selection methods described above are easy to implement but have some drawbacks.

• because of the one at the time scheme, the optimal model can be missed
• there is a multiple comparison problem especially when prediction is of interest, the stepwise model tends to
choose too “small” models
• one should still think if one of the excluded variables has a causal relationship and should therefore remain in
the model

Deviance, AIC and BIC

• In general one can say that models with more parameters will fit the data better.
• Therefore criteria are available which “punish” the number of predictors added. Lets assume we have p predic-
tors in the model.

– The deviance of a regression model is:

Deviance = −2 · log-likelihood

– The Akaike Information Criterion (AIC) is then defined as:

AIC = Deviance + 2p

– The Bayes Information Criterion (BIC) is defined as:

BIC = Deviance + p · log(n)

• When we compare now models we prefer models with a smaller Information Criterion (higher log likelihood).
• These information criteria can also be used to substitute the p-values of the model selection methods. This
avoids for instance the multiple comparison problem.
• These criteria can also be compared when different distributional assumptions are made as long as they are
based on the same number of observations.
2
• Note: In the linear regression case, one can also use the adjusted coefficient of determination RA to compare
models:
2 (1 − R2 )(n − 1)
RA =1− .
n−p−1

102
Linear regression in R
The lm function

The function lm is the function for the basic linear model. Its usage is

lm(formula, data, subset, weights, na.action,


method = "qr", model = TRUE, x = FALSE, y = FALSE,
qr = TRUE, singular.ok = TRUE, contrasts = NULL,
offset, ...)

If we have assigned a lm function call to an object we can directly extract from there many results using indexing.
E.g. coefficients, residuals, fitted.values, rank, weights, df.residual, call, terms, contrasts, xlevels, y.
But often the same with more options can be obtained using generic functions.

lm regression objects

• Assume we fitted with an appropriate model formula a regression model using the function lm and assigned
that to the object lm.out.
• Then a lot of functions have a generic output when applied to this object. What exactly these functions are
doing can be explored using the help pages.
• If we are for example interested to know what summary does to an lm object, we can ask the help for this by
using ?summary.lm.
• In general, for any generic function the specific help can be obtained this way.
• If we just ask for the lm.out object we get only minimal output. That is the model formula and the estimated
parameters.

Function update

• After creating a regression object one often wants to make only a small change, like changing the contrast or
removing or adding a variable.
• One could of course then just call the regression function again and make the changes there, but one could also
use the function update. This function applies to the old object the change which we defined in the update
function.
• Using for example +/- we could add or remove independent variables to / from the model.
• Assume lm.out contains the independent variables x1and x2.
> ## add x3
> lm.out.add <- update(lm.out, . ~ . + x3)

> # eliminate x2
> lm.out.minus <- update(lm.out, . ~ . - x2)

Generic function summary

The summary of an lm object is normally the first you look at.


It provides you with:

• the model formula


• a 5-point statistic for the residuals
• the parameter estimates including there standard errors, t-test statistics and their p-values
• the residual standard error with its df (the residual standard error is the estimate for s, which is the error
variance)
• R2 and RA 2

• the F-test for all versus only the intercept

103
anova for one object

• In the case that we have only one lm object, the function anova returns an ANOVA table.
• This is however a sequential analysis of variance table for that fit.
– That is, the function returns a table which shows the reductions in the residual sum of squares as each
term of the formula is added in turn to the model, plus the residual sum of squares.
– The significance of this change is evaluated with an F-test.
– We start reading this table at the top.
• This means, that table says nothing about whether a variable belongs to the model, it makes only a statement
if the variable improved the fit when added to the model.
• The order how the model is specified matters here.
• For instance if we have the model formula y ~ x + z + w the ANOVA table would look different than when
you would have used y ~ w + z + x.
• For the first model, the last row of the ANOVA table would evaluate if a model with x, z and w is equal to a
model with only x and z. The next row compares then the models x and z against only x.

anova for several objects

• We call models nested when there is a “largest” model and all other models could be seen as subsets of this
“largest”” model.
• If we submit now several lm.objects, which are nested, to the anova function the ANOVA table then compares
the different models.
• R however cannot make sure, that the models are nested. Therefore it just makes the assumption. It is a kind
of convention to start the list with the largest model and arrange them then in descending order.
• Then again we can start our comparison in the last row and compare the results sequentially.

na.action

• Model comparisons based on likelihood tests make the assumption that the design matrix is always the “same”.
• This must be taken into account when the data has missing values.
• Normally, when there are missing values, we delete observations which have missing values in the independent
variables that are used in the current model.
• Therefore often smaller models have more observations than larger models.
• In R we can choose in lm between at least two different na.actions:
– na.omit uses all observations that are possible (no missing values for residuals and fitted values and so
on)
– na.exclude also makes residuals and fitted values comparable when missing values are at hand.

plot

As mentioned earlier, most of the model assumptions of regressions can be evaluated using plots.
R provides by default four plots for diagnostics when an lm.objectis submitted to the plotfunction. Those plots
are:

• residuals vs. fitted


• qqplot for the standardized residuals
• root of standardized residuals vs. fitted values (useful for detecting heteroscedasticity).
• a plot of residuals against leverages

It is often easier to evaluate the fit when plotting all four plots into one window using the par()function.
Other plots can be obtained using the which argument. For details see ?plot.lm.

104
model.matrix

• If one is interested how the design matrix looks one can use the function model.matrix.
• This function returns for an lm object the design matrix where one for example can see which contrast was
used for a factor and so on.
• Especially when there are factors in your model it might be a good idea to check this matrix so that you know
how to interpret the result.

Contrasts in R

As mentioned earlier, factors need dummy variables when they enter a regression model. Depending on that coding,
the interpretation of the parameter estimates changes. Which types of contrasts R uses by default can be found out
using the command:

> getOption("contrasts")
unordered ordered
"contr.treatment" "contr.poly"

There one can see what R uses as default contrasts for unordered factors and ordered factors.
The contrasts discussed earlier have in R the following names:

• treatment contrast: contr.treatment


• helmert contrast: contr.helmert
• sum contrast: contr.sum
• polynomial contrast: contr.poly

To specify the characteristics of each contrast like which is the default comparison level in the treatment contrast see
the help for the contrast of interest.
Recall here also the function relevel.
If one wants different contrasts than the default ones, there are two ways to change it. First we can change it globally,
so that it effects all applications where we need contrasts. Then we use the option command and specify there the
default contrast for unordered and ordered contrasts. E.g.:

> options(contrasts=c('contr.sum', 'contr.helmert'))

Or we change it only in our regression function call. Here can even use several different contrasts. If we call for
example the regression function lm and we have two factors, named factor1 (with treatment contrast) and factor2
(helmert contrast) we could use:

> lm.out <- lm(...,


+ contrasts = c(factor1 = contr.treatment,
+ factor2 = contr.helmert))

Fitted values

• There is a generic function to extract fitted values from a regression object. That function is called fitted.
• However especially for lm objects there are also two other ways to extract fitted values. Let us call our lm
object again lm.out. Then we can get the fitted values using:

– fitted(lm.out)
– fitted.values(lm.out)
– lm.out$fitted

105
Residuals in R

• As mentioned above, residuals are important features for model diagnostics.


• R offers a lot of residual types which can be extracted from a regression object. The basic function for this
purpose is residuals.
• Which types of residuals can be obtained from lm objects can be found out using the help for residuals.lm.
• For the studentized residuals there is also a special function, rstudent, and correspondingly the function
rstandard for standardized residuals.
• Furthermore, the basic residuals can also just be extracted using:

> lm.out$res

Predictions in R I

The motivation to fit a regression model can have several reasons. One reason is to predict the dependent variable
given new subjects or to predict the development in the future.
It is quite easy to get predictions in R. One needs mainly two steps to get them. First one has to create a data frame
(data.new) that contains the settings of the independent variables for which a prediction is wanted. Then one uses
the function predict to obtain the predictions.
Assume one wants to predict for the lm.out object and one has a data frame data.new for which one wants to predict.
Then use:

> predict(lm.out, data.new)

When we are also interested in confidence intervals we can add the interval argument.
We can specify if we want the real prediction interval (which takes also the variation of the errors into account):

> predict(lm.out, data.new,


+ interval = "prediction")

or the confidence interval i.e., the interval for the expected value of the response

> predict(lm.out, data.new,


+ interval = "confidence")

Influence diagnostics in R

• For a regression object lm.out, the function influence.measures(lm.out) will return a data frame containing
all important influence measures such as:

– DFBETAS: measures the difference in each parameter estimate with and without the influential point.
– DFFITS: scaled difference between the ith fitted value ŷi obtained from the full data and the ith predicted
value ŷ(i) obtained by deleting the ith observation.
ri2 hi
– Cook’s distance: Di = (p+1)σ 2 (1−hi )2

– covariance ratios: 2
det(σ̂(i) (X(i) X(i) )−1 )/det(σ̂ 2 (X ⊤ X)−1 )
– leverage values for each observation (column hat).

• Observations assumed to be influential concerning any of the diagnostics are marked with an asterisk.

106
Model selection in R

• Automatic model selection is also possible in R. However not based on p-values but on AIC or BIC. The
function for this is the function step.
• It can perform all three different types of selections: backward, forward and stepwise.
• One can even specify minimal and maximal models between which we want to choose. In general one can
punish here the number of parameters with any weight k. But only the settings k = 2 (AIC) or k = log(n)
(BIC) have then a theoretical foundation.

Examples
Cherry Tree Example I

As a first example consider the trees data set from the MASS package. The data set contains the girth, height and
volume of 31 felled black cherry trees. The aim is to obtain a model which can be used to predict the volume of a
tree based on its height and girth.

> data("trees", package = "MASS")


Warning in data("trees", package = "MASS"): data set 'trees' not found
> head(trees)
Girth Height Volume
1 8.3 70 10.3
2 8.6 65 10.3
3 8.8 63 10.2
4 10.5 72 16.4
5 10.7 81 18.8
6 10.8 83 19.7

Cherry Tree Example II

> str(trees)
'data.frame': 31 obs. of 3 variables:
$ Girth : num 8.3 8.6 8.8 10.5 10.7 10.8 11 11 11.1 11.2 ...
$ Height: num 70 65 63 72 81 83 66 75 80 75 ...
$ Volume: num 10.3 10.3 10.2 16.4 18.8 19.7 15.6 18.2 22.6 19.9 ...
> summary(trees)
Girth Height Volume
Min. : 8.3 Min. :63 Min. :10.2
1st Qu.:11.1 1st Qu.:72 1st Qu.:19.4
Median :12.9 Median :76 Median :24.2
Mean :13.2 Mean :76 Mean :30.2
3rd Qu.:15.2 3rd Qu.:80 3rd Qu.:37.3
Max. :20.6 Max. :87 Max. :77.0

Cherry Tree Example III

> plot(trees)

107
65 70 75 80 85

20
16
Girth

12
8
85

Height
75
65

70
50
Volume

30
10
8 10 12 14 16 18 20 10 20 30 40 50 60 70

Cherry Tree Example IV

Let us first fit a marginal model for the two explaining variables.

> options(show.signif.stars=FALSE)
> fit.girth <- lm(Volume ~ Girth, data = trees)
> summary(fit.girth)

Call:
lm(formula = Volume ~ Girth, data = trees)

Residuals:
Min 1Q Median 3Q Max
-8.065 -3.107 0.152 3.495 9.587

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -36.943 3.365 -11.0 7.6e-12
Girth 5.066 0.247 20.5 < 2e-16

Residual standard error: 4.25 on 29 degrees of freedom


Multiple R-squared: 0.935, Adjusted R-squared: 0.933
F-statistic: 419 on 1 and 29 DF, p-value: <2e-16

Cherry Tree Example V

> fit.height <- lm(Volume ~ Height, data = trees)


> summary(fit.height)

Call:

108
lm(formula = Volume ~ Height, data = trees)

Residuals:
Min 1Q Median 3Q Max
-21.27 -9.89 -2.89 12.07 29.85

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -87.124 29.273 -2.98 0.00583
Height 1.543 0.384 4.02 0.00038

Residual standard error: 13.4 on 29 degrees of freedom


Multiple R-squared: 0.358, Adjusted R-squared: 0.336
F-statistic: 16.2 on 1 and 29 DF, p-value: 0.000378

Cherry Tree Example VI

Model containing both explaining variables.

> fit.both <- lm(Volume ~ Girth + Height, data = trees)


> summary(fit.both)

Call:
lm(formula = Volume ~ Girth + Height, data = trees)

Residuals:
Min 1Q Median 3Q Max
-6.406 -2.649 -0.288 2.200 8.485

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -57.988 8.638 -6.71 2.7e-07
Girth 4.708 0.264 17.82 < 2e-16
Height 0.339 0.130 2.61 0.014

Residual standard error: 3.88 on 28 degrees of freedom


Multiple R-squared: 0.948, Adjusted R-squared: 0.944
F-statistic: 255 on 2 and 28 DF, p-value: <2e-16

Cherry Tree Example VII

> coef(fit.both)
(Intercept) Girth Height
-57.9877 4.7082 0.3393

> confint(fit.both)
2.5 % 97.5 %
(Intercept) -75.68226 -40.2931
Girth 4.16684 5.2495
Height 0.07265 0.6059

Cherry Tree Example VIII

A model with a second degree polynomial for both variables.

109
> fit.full <- lm(Volume ~ Girth + I(Girthˆ2) + Height + I(Heightˆ2),
+ data = trees)
> summary(fit.full)

Call:
lm(formula = Volume ~ Girth + I(Girthˆ2) + Height + I(Heightˆ2),
data = trees)

Residuals:
Min 1Q Median 3Q Max
-4.368 -1.670 -0.158 1.792 4.358

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.95510 63.01363 -0.02 0.988
Girth -2.79657 1.46868 -1.90 0.068
I(Girthˆ2) 0.26545 0.05169 5.14 2.4e-05
Height 0.11937 1.78459 0.07 0.947
I(Heightˆ2) 0.00172 0.01190 0.14 0.886

Residual standard error: 2.67 on 26 degrees of freedom


Multiple R-squared: 0.977, Adjusted R-squared: 0.974
F-statistic: 277 on 4 and 26 DF, p-value: <2e-16

Cherry Tree Example IX

> coef(fit.full)
(Intercept) Girth I(Girthˆ2) Height I(Heightˆ2)
-0.955101 -2.796569 0.265446 0.119372 0.001717

> confint(fit.full)
2.5 % 97.5 %
(Intercept) -130.48147 128.57127
Girth -5.81548 0.22234
I(Girthˆ2) 0.15920 0.37169
Height -3.54890 3.78765
I(Heightˆ2) -0.02275 0.02619

> anova(fit.full)
Analysis of Variance Table

Response: Volume
Df Sum Sq Mean Sq F value Pr(>F)
Girth 1 7582 7582 1060.60 < 2e-16
I(Girthˆ2) 1 213 213 29.78 1e-05
Height 1 125 125 17.54 0.00029
I(Heightˆ2) 1 0 0 0.02 0.88645
Residuals 26 186 7

Cherry Tree Example X

The polynomials make the parameters difficult to interpret.


Also the squared variables are highly correlated to the original variables.

110
> with(trees, cor(Girth, Girthˆ2))
[1] 0.993
> with(trees, cor(Height, Heightˆ2))
[1] 0.9989

The high correlation can be avoided by centering:

> m.Girth <- with(trees, mean(Girth))


> m.Height <- with(trees, mean(Height))
> with(trees, cor(Girth-m.Girth, (Girth-m.Girth)ˆ2))
[1] 0.438
> with(trees, cor(Height-m.Height, (Height-m.Height)ˆ2))
[1] -0.3134

Cherry Tree Example XI

So lets use the centered variables.

> fit.full.c <- lm(Volume ~ I(Girth - m.Girth) + I((Girth - m.Girth)ˆ2) +


+ I(Height - m.Height) + I((Height - m.Height)ˆ2), data = trees)
> summary(fit.full.c)

Call:
lm(formula = Volume ~ I(Girth - m.Girth) + I((Girth - m.Girth)ˆ2) +
I(Height - m.Height) + I((Height - m.Height)ˆ2), data = trees)

Residuals:
Min 1Q Median 3Q Max
-4.368 -1.670 -0.158 1.792 4.358

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 27.57375 0.70403 39.17 < 2e-16
I(Girth - m.Girth) 4.23689 0.20222 20.95 < 2e-16
I((Girth - m.Girth)ˆ2) 0.26545 0.05169 5.14 2.4e-05
I(Height - m.Height) 0.38031 0.09390 4.05 0.00041
I((Height - m.Height)ˆ2) 0.00172 0.01190 0.14 0.88645

Residual standard error: 2.67 on 26 degrees of freedom


Multiple R-squared: 0.977, Adjusted R-squared: 0.974
F-statistic: 277 on 4 and 26 DF, p-value: <2e-16

Cherry Tree Example XII

Let’ s eliminate the squared term for Height as its not significant:

> fit.2 <- lm(Volume ~ I(Girth-m.Girth)+ I((Girth-m.Girth)ˆ2)


+ + I(Height-m.Height), data = trees)
> summary(fit.2)

Call:
lm(formula = Volume ~ I(Girth - m.Girth) + I((Girth - m.Girth)ˆ2) +
I(Height - m.Height), data = trees)

Residuals:

111
Min 1Q Median 3Q Max
-4.293 -1.669 -0.102 1.785 4.349

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 27.6109 0.6431 42.93 < 2e-16
I(Girth - m.Girth) 4.2325 0.1963 21.56 < 2e-16
I((Girth - m.Girth)ˆ2) 0.2686 0.0459 5.85 3.1e-06
I(Height - m.Height) 0.3764 0.0882 4.27 0.00022

Residual standard error: 2.62 on 27 degrees of freedom


Multiple R-squared: 0.977, Adjusted R-squared: 0.975
F-statistic: 383 on 3 and 27 DF, p-value: <2e-16

Cherry Tree Example XIII

Some model comparison if we need the dropped term

> anova(fit.full.c, fit.2)


Analysis of Variance Table

Model 1: Volume ~ I(Girth - m.Girth) + I((Girth - m.Girth)ˆ2) + I(Height -


m.Height) + I((Height - m.Height)ˆ2)
Model 2: Volume ~ I(Girth - m.Girth) + I((Girth - m.Girth)ˆ2) + I(Height -
m.Height)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 26 186
2 27 186 -1 -0.149 0.02 0.89

> op <- par(mfrow = c(2, 2))


> plot(fit.2)
> par(op)

112
Cherry Tree Example XIV

Standardized residuals
Residuals vs Fitted Normal Q−Q
17 26 17
Residuals

1
0

−1
−4

30 18 30

10 20 30 40 50 60 70 80 −2 −1 0 1 2

Fitted values Theoretical Quantiles


Standardized residuals

Standardized residuals
Scale−Location Residuals vs Leverage
1718 30

2
17
0.5
0.8

0
Cook's distance
0.0

1
−2
30
18

10 20 30 40 50 60 70 80 0.0 0.1 0.2 0.3 0.4 0.5

Fitted values Leverage

Cherry Tree Example XV

The model in this form might not be fully satisfactory.


However to make a check, assume the tree would be a perfect cylinder with height h and radius r, then

• the girth of the circle is g = 2πr


• Note: the help page of the data set explains that the variable Girth actually measures diameter d = 2r rather
than girth.
• the volume of the cylinder is v = πr2 h
• a multiplicative relationship might be more appropriate

Cherry Tree Example XVI

We can therefore estimate a regression on the log scale:

> fit.log <- lm(log(Volume) ~ log(Girth) + log(Height), data = trees)


> summary(fit.log)

Call:
lm(formula = log(Volume) ~ log(Girth) + log(Height), data = trees)

Residuals:
Min 1Q Median 3Q Max
-0.16856 -0.04849 0.00243 0.06364 0.12922

Coefficients:

113
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6.632 0.800 -8.29 5.1e-09
log(Girth) 1.983 0.075 26.43 < 2e-16
log(Height) 1.117 0.204 5.46 7.8e-06

Residual standard error: 0.0814 on 28 degrees of freedom


Multiple R-squared: 0.978, Adjusted R-squared: 0.976
F-statistic: 613 on 2 and 28 DF, p-value: <2e-16

Cherry Tree Example XVII

The diagnostic plot look better than before:

Standardized residuals
Residuals vs Fitted Normal Q−Q

2
Residuals

0.05

0
−0.20

−2
15 16
18 15 16
18

2.5 3.0 3.5 4.0 −2 −1 0 1 2

Fitted values Theoretical Quantiles


Standardized residuals

Standardized residuals

Scale−Location Residuals vs Leverage


18 0.5
2

15 16 11 17
1.0

Cook's distance
−2
0.0

18 0.5

2.5 3.0 3.5 4.0 0.00 0.05 0.10 0.15 0.20 0.25

Fitted values Leverage

Anorexia Example

Next we will use the anorexia data which is also in the MASS package.
The data set has three variables:

• Treat
Type of psychotherapy. Factor of three levels Cont, CBT and FT. Cont should be the reference group.
• Prewt
Weight of the subject before the treatment in lbs.
• Postwt
Weight of the subject after the treatment in lbs.

Of interest is now, if the treatments have different effects on the weight of the subjects.

114
Anorexia Example I

This data set contains the effect of different forms of therapy on the body weight of subjects suffering from anorexia.

> data("anorexia", package = "MASS")


> str(anorexia)
'data.frame': 72 obs. of 3 variables:
$ Treat : Factor w/ 3 levels "CBT","Cont","FT": 2 2 2 2 2 2 2 2 2 2 ...
$ Prewt : num 80.7 89.4 91.8 74 78.1 88.3 87.3 75.1 80.6 78.4 ...
$ Postwt: num 80.2 80.1 86.4 86.3 76.1 78.1 75.1 86.7 73.5 84.6 ...

We can have a look at every twelveth observation in the data

> id <- seq(12, nrow(anorexia), by = 12)


> anorexia[id,]
Treat Prewt Postwt
12 Cont 88.7 79.5
24 Cont 77.5 81.2
36 CBT 80.5 82.1
48 CBT 76.5 75.7
60 FT 86.7 100.3
72 FT 87.3 98.0

Anorexia Example II

> summary(anorexia)
Treat Prewt Postwt
CBT :29 Min. :70.0 Min. : 71.3
Cont:26 1st Qu.:79.6 1st Qu.: 79.3
FT :17 Median :82.3 Median : 84.0
Mean :82.4 Mean : 85.2
3rd Qu.:86.0 3rd Qu.: 91.5
Max. :94.9 Max. :103.6

> anorexia$TREAT <- relevel(anorexia$Treat, ref = "Cont")

> boxplot(Prewt ~ TREAT, data = anorexia, ylab = "preweight")


> boxplot(Postwt ~ TREAT, data = anorexia, ylab = "postweight")

115
Anorexia Example III

95
90
preweight

85
80
75
70

Cont CBT FT

TREAT

Anorexia Example IV
95 100
postweight

90
85
80
75

Cont CBT FT

TREAT

Anorexia Example V

This shows how pipes (Chapter 3) can be used for summarizing data frames:

116
> anorexia |>
+ subset(select = Prewt:TREAT) |>
+ with(aggregate(cbind(Prewt, Postwt),
+ data.frame(TREAT),
+ function(x) c(mean=mean(x), sd = sd(x)))) |>
+ cbind(n.group = with(anorexia, tapply(Prewt, TREAT, length)))
TREAT Prewt.mean Prewt.sd Postwt.mean Postwt.sd n.group
Cont Cont 81.558 5.707 81.108 4.744 26
CBT CBT 82.690 4.845 85.697 8.352 29
FT FT 83.229 5.017 90.494 8.475 17

Anorexia Example VI

We fit a linear model with TREAT as the explanatory variable. Note that the treatment contrasts are used by default.

> anfit1 <- lm(Postwt ~ TREAT, data = anorexia) # includes intercept


>
> model.matrix(anfit1)[id, ]
(Intercept) TREATCBT TREATFT
12 1 0 0
24 1 0 0
36 1 1 0
48 1 1 0
60 1 0 1
72 1 0 1

Anorexia Example VII

> summary(anfit1)

Call:
lm(formula = Postwt ~ TREAT, data = anorexia)

Residuals:
Min 1Q Median 3Q Max
-15.294 -3.730 -0.002 4.781 17.903

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 81.11 1.43 56.75 <2e-16
TREATCBT 4.59 1.97 2.33 0.0227
TREATFT 9.39 2.27 4.13 0.0001

Residual standard error: 7.29 on 69 degrees of freedom


Multiple R-squared: 0.2, Adjusted R-squared: 0.177
F-statistic: 8.65 on 2 and 69 DF, p-value: 0.000444

The intercept coefficient gives the average weight (post treatment) for the Cont control (i.e., reference) group; the
TREATCBT coef shows that patients in the CBT group have on average 4.589 lbs more than the reference group; the
TREATFT coef shows that patients in the FT group have on average 9.386 lbs more than the reference group.

Anorexia Example VIII

We can also fit a model without intercept.

117
> anfit1b <- lm(Postwt ~ TREAT - 1, data = anorexia)
> model.matrix(anfit1b)[id,]
TREATCont TREATCBT TREATFT
12 1 0 0
24 1 0 0
36 0 1 0
48 0 1 0
60 0 0 1
72 0 0 1

Anorexia Example IX

The coefficients now represent the average weight post treatment in each category.

> summary(anfit1b)

Call:
lm(formula = Postwt ~ TREAT - 1, data = anorexia)

Residuals:
Min 1Q Median 3Q Max
-15.294 -3.730 -0.002 4.781 17.903

Coefficients:
Estimate Std. Error t value Pr(>|t|)
TREATCont 81.11 1.43 56.8 <2e-16
TREATCBT 85.70 1.35 63.3 <2e-16
TREATFT 90.49 1.77 51.2 <2e-16

Residual standard error: 7.29 on 69 degrees of freedom


Multiple R-squared: 0.993, Adjusted R-squared: 0.993
F-statistic: 3.28e+03 on 3 and 69 DF, p-value: <2e-16

Note: 1. The hypothesis tests are not so informative as one is typically interested in whether the differences among
the groups are significant.

2. R2 is totally off. Should not be used in models without intercept.

Anorexia Example X

Let’s have a look at using the sum contrasts:

> anfit1c <- lm(Postwt ~ TREAT, data = anorexia,


+ contrast = list(TREAT = "contr.sum"))
> model.matrix(anfit1c)[id,]
(Intercept) TREAT1 TREAT2
12 1 1 0
24 1 1 0
36 1 0 1
48 1 0 1
60 1 -1 -1
72 1 -1 -1

Anorexia Example XI

118
> summary(anfit1c)

Call:
lm(formula = Postwt ~ TREAT, data = anorexia, contrasts = list(TREAT = "contr.sum"))

Residuals:
Min 1Q Median 3Q Max
-15.294 -3.730 -0.002 4.781 17.903

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 85.7661 0.8819 97.26 < 2e-16
TREAT1 -4.6584 1.2078 -3.86 0.00025
TREAT2 -0.0696 1.1782 -0.06 0.95309

Residual standard error: 7.29 on 69 degrees of freedom


Multiple R-squared: 0.2, Adjusted R-squared: 0.177
F-statistic: 8.65 on 2 and 69 DF, p-value: 0.000444

• (Intercept) (β0 ) - mean of the mean weight in each group (a bit weird . . . if the data is were balanced, it
would be the mean weight in whole dataset).
• TREAT1(β1 ) - deviation of average weight for Cont from the intercept.
• TREAT2(β2 ) - deviation of average weight for CBT from the intercept.
• We don’t have any coefficient for FT as it is by construction 1 − β1 − β2 .
• Note: The F-statistic and R2 do not change, we only transform the coefficients.

Anorexia Example XII

Let’s have a look at using the helmert contrasts:

> anfit1d <- lm(Postwt ~ TREAT, data = anorexia,


+ contrast = list(TREAT = "contr.helmert"))
> model.matrix(anfit1d)[id,]
(Intercept) TREAT1 TREAT2
12 1 -1 -1
24 1 -1 -1
36 1 1 -1
48 1 1 -1
60 1 0 2
72 1 0 2

Anorexia Example XIII

> summary(anfit1d)

Call:
lm(formula = Postwt ~ TREAT, data = anorexia, contrasts = list(TREAT = "contr.helmert"))

Residuals:
Min 1Q Median 3Q Max
-15.294 -3.730 -0.002 4.781 17.903

Coefficients:
Estimate Std. Error t value Pr(>|t|)

119
(Intercept) 85.766 0.882 97.26 < 2e-16
TREAT1 2.294 0.984 2.33 0.02267
TREAT2 2.364 0.674 3.51 0.00081

Residual standard error: 7.29 on 69 degrees of freedom


Multiple R-squared: 0.2, Adjusted R-squared: 0.177
F-statistic: 8.65 on 2 and 69 DF, p-value: 0.000444

• (Intercept) (β0 ) - mean of the mean weight in each group (still weird . . . ).
• TREAT1(β1 ) - the average value of the means in
Cont and CBT is 2.29 lbs higher than the mean of Cont.
• TREAT2(β2 ) - the average value of the means in
Cont, CBT and FT is 2.364 lbs higher than the average value of the means in Cont and CBT.
• Note: Not the most intuitive . . .

Anorexia Example XIV

Let’s have a look now at a scatterplot of Prewt and Postwt and color the points by TREAT

> op <- par(mfrow = c(2, 2))


> cols <- c("lightblue", "orange", "darkgreen")
> plot(Postwt ~ Prewt, data = anorexia, col = cols[as.numeric(TREAT)],
+ pch = as.numeric(TREAT) + 14,
+ main = "All groups")
> legend("topleft", pch = 1:3 + 15, col = cols,
+ legend = levels(anorexia$TREAT))
> plot(Postwt ~ Prewt, data=anorexia, col=as.numeric(TREAT),
+ pch=as.numeric(TREAT),
+ main = "Cont", subset=TREAT=="Cont")
> plot(Postwt ~ Prewt, data = anorexia, col = cols[as.numeric(TREAT)],
+ pch = as.numeric(TREAT) + 14,
+ main="Cont", subset=TREAT=="Cont")
> plot(Postwt ~ Prewt, data = anorexia, col = cols[as.numeric(TREAT)],
+ pch = as.numeric(TREAT) + 14,
+ main="CBT", subset = TREAT == "CBT")
> plot(Postwt ~ Prewt, data = anorexia, col = cols[as.numeric(TREAT)],
+ pch = as.numeric(TREAT) + 14,
+ main = "FT", subset = TREAT=="FT")
> par(op)

120
Anorexia Example XV

All groups

Cont

85
Postwt

Postwt
90
CBT
FT

75
75
70 75 80 85 90 95 70

Prewt

CBT

Postwt

Postwt
90

90
75

75
70 75 80 85 90 95

The relationship seems different for the different TREAT groups: Prewt

Anorexia Example XVI

The following will fit a regression with the same slope but different intercepts for the different TREAT groups.

> anfit2 <- lm(Postwt ~ TREAT + Prewt, data = anorexia)


> coef(anfit2)
(Intercept) TREATCBT TREATFT Prewt
45.6740 4.0971 8.6601 0.4345

We can plot the regression lines for each class by first calculating the main effects (i.e., separate intercepts) from the
coefficients:

> plot(Postwt ~ Prewt, data = anorexia, col = cols[as.numeric(TREAT)],


+ pch = as.numeric(TREAT) + 14)
> legend("topleft", pch = 1:3 + 14, col = cols,
+ legend = levels(anorexia$TREAT))
> abline(coef(anfit2)[1], coef(anfit2)[4], col = cols[1])
> abline(coef(anfit2)[1] + coef(anfit2)[2], coef(anfit2)[4],
+ col = cols[2])
> abline(coef(anfit2)[1] + coef(anfit2)[3], coef(anfit2)[4],
+ col = cols[3])
> with(anorexia, points(Prewt, fitted(anfit2),
+ pch = "x", # represent fitted values
+ col = cols[as.numeric(TREAT)]))

121
Anorexia Example XVII

Cont
95 100
CBT
FT
x
x
x xx x
xxx
90

x xx
Postwt

x x x
x
xx
xxxx x xxx x
xxx x
85

xxxxxxxxx x xxxx
xx xx xx x
xx x
xxx xx xx
80

x
x
x x x
75

70 75 80 85 90 95

Prewt

Anorexia Example XVIII

The following will fit a regression with different slopes and different intercepts for the different TREAT groups.

> anfit3 <- lm(Postwt ~ TREAT * Prewt, data = anorexia)


> summary(anfit3)

Call:
lm(formula = Postwt ~ TREAT * Prewt, data = anorexia)

Residuals:
Min 1Q Median 3Q Max
-12.812 -3.850 -0.915 4.001 15.964

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 92.051 18.809 4.89 6.7e-06
TREATCBT -76.474 28.347 -2.70 0.0089
TREATFT -77.232 33.133 -2.33 0.0228
Prewt -0.134 0.230 -0.58 0.5617
TREATCBT:Prewt 0.982 0.344 2.85 0.0058
TREATFT:Prewt 1.043 0.400 2.61 0.0112

Residual standard error: 6.57 on 66 degrees of freedom


Multiple R-squared: 0.379, Adjusted R-squared: 0.332
F-statistic: 8.07 on 5 and 66 DF, p-value: 5.5e-06

Anorexia Example XIX

122
> model.matrix(anfit3)[id,]
(Intercept) TREATCBT TREATFT Prewt TREATCBT:Prewt TREATFT:Prewt
12 1 0 0 88.7 0.0 0.0
24 1 0 0 77.5 0.0 0.0
36 1 1 0 80.5 80.5 0.0
48 1 1 0 76.5 76.5 0.0
60 1 0 1 86.7 0.0 86.7
72 1 0 1 87.3 0.0 87.3

Anorexia Example XX

> summary(anfit3)

Call:
lm(formula = Postwt ~ TREAT * Prewt, data = anorexia)

Residuals:
Min 1Q Median 3Q Max
-12.812 -3.850 -0.915 4.001 15.964

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 92.051 18.809 4.89 6.7e-06
TREATCBT -76.474 28.347 -2.70 0.0089
TREATFT -77.232 33.133 -2.33 0.0228
Prewt -0.134 0.230 -0.58 0.5617
TREATCBT:Prewt 0.982 0.344 2.85 0.0058
TREATFT:Prewt 1.043 0.400 2.61 0.0112

Residual standard error: 6.57 on 66 degrees of freedom


Multiple R-squared: 0.379, Adjusted R-squared: 0.332
F-statistic: 8.07 on 5 and 66 DF, p-value: 5.5e-06

Note:

• Coefficient of Prewt corresponds to slope for reference category Cont.


• Coefficient of TREATCBT:Prewt gives the difference in slope for category CBT and slope for reference category.
• Coefficient of TREATFT:Prewt gives the difference in slope for category FT and slope for reference category.

Anorexia Example XXI

From the coefficients we can compute the intercepts and slopes of the different regression lines:

> coef(anfit3)
(Intercept) TREATCBT TREATFT Prewt TREATCBT:Prewt
92.0515 -76.4742 -77.2317 -0.1342 0.9822
TREATFT:Prewt
1.0434

> plot(Postwt ~ Prewt, data = anorexia,


+ col=cols[as.numeric(TREAT)],
+ pch = as.numeric(TREAT) + 14)
> legend("topleft", pch = 1:3 + 14, col = cols,
+ legend = levels(anorexia$TREAT))

123
> abline(coef(anfit3)[1],coef(anfit3)[4], col=cols[1])
> abline(coef(anfit3)[1]+coef(anfit3)[2],coef(anfit3)[4]+
+ coef(anfit3)[5], col=cols[2])
> abline(coef(anfit3)[1]+coef(anfit3)[3],coef(anfit3)[4]+
+ coef(anfit3)[6], col=cols[3])
> with(anorexia, points(Prewt,fitted(anfit3),
+ pch = "x",
+ col = cols[as.numeric(TREAT)]))

Anorexia Example XXII

Cont
95 100

CBT x
FT x x
x xx
x x xx
xx x
90

xxx
Postwt

x x
x x xxx
x xxx x
85

x xxx
x x xx x xxxxxx
xxx xx xx xx x xx xx x x xxxx
xx
80

x
75

70 75 80 85 90 95

Prewt

Anorexia Example XXIII

To make the intercept more interpretable we could subtract the minimum value from Prewt:

> min(anorexia$Prewt)
[1] 70
> anorexia$Prewt2 <-
+ anorexia$Prewt - min(anorexia$Prewt)

We refit the model:

> anfit4 <- lm(Postwt ~ TREAT * Prewt2, data=anorexia)

Anorexia Example XXIV

> summary(anfit4)

Call:
lm(formula = Postwt ~ TREAT * Prewt2, data = anorexia)

124
Residuals:
Min 1Q Median 3Q Max
-12.812 -3.850 -0.915 4.001 15.964

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 82.659 2.954 27.98 <2e-16
TREATCBT -7.723 4.558 -1.69 0.0949
TREATFT -4.193 5.477 -0.77 0.4467
Prewt2 -0.134 0.230 -0.58 0.5617
TREATCBT:Prewt2 0.982 0.344 2.85 0.0058
TREATFT:Prewt2 1.043 0.400 2.61 0.0112

Residual standard error: 6.57 on 66 degrees of freedom


Multiple R-squared: 0.379, Adjusted R-squared: 0.332
F-statistic: 8.07 on 5 and 66 DF, p-value: 5.5e-06

Anorexia Example XXV

> op <- par(mfrow = c(2,2)); plot(anfit4); par(op)


Standardized residuals

Residuals vs Fitted Normal Q−Q


15

41 41
Residuals

3464 3464
2
0

−2 0
−15

75 80 85 90 95 100 −2 −1 0 1 2

Fitted values Theoretical Quantiles


Standardized residuals

Standardized residuals

Scale−Location Residuals vs Leverage


41 41 1
3464 34 64 0.5
2
1.0

−2 0

Cook's distance
0.0

0.5

75 80 85 90 95 100 0.00 0.10 0.20 0.30

Fitted values Leverage

Anorexia Example XXVI: Prediction

How would the weight of a patient with a weight of 90lbs before the study change post-study depending on the
treatment?

125
> new.data <- data.frame(Prewt = c(90, 90, 90),
+ Prewt2 = c(20, 20, 20),
+ TREAT = factor(c("Cont","CBT","FT"),
+ levels = c("Cont","CBT","FT")))
> new.data
Prewt Prewt2 TREAT
1 90 20 Cont
2 90 20 CBT
3 90 20 FT

> predict(anfit4, new.data, interval="predict")


fit lwr upr
1 79.97 66.07 93.88
2 91.90 78.05 105.74
3 96.65 82.46 110.84

Scottish Hills Example

As a short final example the Scottish Hills data which gives the record times in 1984 for 35 Scottish hill races. The
variables are:

• dist distance in miles (on the map).


• climb total height gained during the route, in feet.
• time record time in minutes.

> data("hills", package = "MASS")


> fit.lm <- lm(time ~ dist + climb, data = hills)

Scottish Hills Example I

A visualization of the data set:


1000 3000 5000 7000

25

dist
15
5
7000
4000

climb
1000

200

time
50 100

5 10 15 20 25 50 100 150 200

126
Scottish Hills Example II

> summary(fit.lm)

Call:
lm(formula = time ~ dist + climb, data = hills)

Residuals:
Min 1Q Median 3Q Max
-16.22 -7.13 -1.19 2.37 65.12

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -8.99204 4.30273 -2.09 0.045
dist 6.21796 0.60115 10.34 9.9e-12
climb 0.01105 0.00205 5.39 6.4e-06

Residual standard error: 14.7 on 32 degrees of freedom


Multiple R-squared: 0.919, Adjusted R-squared: 0.914
F-statistic: 182 on 2 and 32 DF, p-value: <2e-16

Scottish Hills Example III

> op <- par(mfrow = c(2, 2)); plot(fit.lm); par(op)


Standardized residuals

Residuals vs Fitted Normal Q−Q


Knock Hill Knock Hill
Residuals

4
40

Bens of Jura Bens of Jura


2
−20

−1

Ben Nevis Ben Nevis

50 100 150 −2 −1 0 1 2

Fitted values Theoretical Quantiles


Standardized residuals

Standardized residuals

Scale−Location Residuals vs Leverage


0.0 1.0 2.0

Knock Hill Knock Hill


2 4

Bens of Jura
Bens of Jura
Ben Nevis
Lairig Ghru 0.5
Cook's distance
−1

50 100 150 0.0 0.2 0.4 0.6

Fitted values Leverage

Scottish Hills Example IV

Knock Hill has a large residual and has Cook distance close to 0.5.

127
> # removing Knock Hill
> fit.lm.wKH <- update(fit.lm, subset = -18)
> coef(fit.lm.wKH)
(Intercept) dist climb
-13.53035 6.36456 0.01185

Bens of Jura race was identified as an influential point given it has a Cook distance of above one.

> # Removing also BensJura


> fit.lm.wKH.J <- update(fit.lm, subset = -c(7,18))
> coef(fit.lm.wKH.J)
(Intercept) dist climb
-10.361646 6.692114 0.008047

Change in coefficients is observable.

Scottish Hills Example V

> op <- par(mfrow = c(2, 2)); plot(fit.lm.wKH.J); par(op) Standardized residuals

Residuals vs Fitted Normal Q−Q


20

Two Breweries Two Breweries


Residuals

Cairn Table Cairn Table


5

−1 1
−10

Black Hill Black Hill

50 100 150 200 −2 −1 0 1 2

Fitted values Theoretical Quantiles


Standardized residuals

Standardized residuals

Scale−Location Residuals vs Leverage


Two Breweries Two Breweries
3

CairnHill
Black Table
1.0

0.5
Lairig Ghru
Cook's distance
Ben Nevis 1
−2
0.0

50 100 150 200 0.0 0.2 0.4 0.6

Fitted values Leverage

Scottish Hills Example VI

We can also weight observations in the regression model (by default all observations contribute equally to the esti-
mation of the coefficients)

128
> # weights 1/distˆ2 - long distance races get less weight
> fit.lm2 <- lm(time ~ dist + climb, weight = 1 / distˆ2, data = hills)
> summary(fit.lm2)

Call:
lm(formula = time ~ dist + climb, data = hills, weights = 1/distˆ2)

Weighted Residuals:
Min 1Q Median 3Q Max
-3.728 -1.521 -0.513 0.324 18.620

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.62715 6.26766 0.58 0.5668
dist 5.93960 1.71496 3.46 0.0015
climb 0.00384 0.00482 0.80 0.4321

Residual standard error: 3.7 on 32 degrees of freedom


Multiple R-squared: 0.458, Adjusted R-squared: 0.425
F-statistic: 13.5 on 2 and 32 DF, p-value: 5.48e-05

Scottish Hills Example VII

> str(influence.measures(fit.lm2))
List of 3
$ infmat: num [1:35, 1:7] -0.22373 0.00126 0.01023 0.01639 0.00281 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:35] "Greenmantle" "Carnethy" "Craig Dunain" "Ben Rha" ...
.. ..$ : chr [1:7] "dfb.1_" "dfb.dist" "dfb.clmb" "dffit" ...
$ is.inf: logi [1:35, 1:7] FALSE FALSE FALSE FALSE FALSE FALSE ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:35] "Greenmantle" "Carnethy" "Craig Dunain" "Ben Rha" ...
.. ..$ : chr [1:7] "dfb.1_" "dfb.dist" "dfb.clmb" "dffit" ...
$ call : language lm(formula = time ~ dist + climb, data = hills, weights = 1/distˆ2)
- attr(*, "class")= chr "infl"

Scottish Hills Example VIII

For the weighted regression Knock Hill is influential (it has low value for distance, so its influence gets higher due to
the weight). Also Cow Hills has a small distance.

> summary(influence.measures(fit.lm2))
Potentially influential observations of
lm(formula = time ~ dist + climb, data = hills, weights = 1/distˆ2) :

dfb.1_ dfb.dist dfb.clmb dffit cov.r cook.d hat


Knock Hill 4.82_* 0.22 -3.44_* 6.51_* 0.00_* 1.38_* 0.12
Cow Hill -0.11 0.09 -0.05 -0.13 1.71_* 0.01 0.36_*

Scottish Hills Example IX

> op <- par(mfrow = c(2, 2)); plot(fit.lm2); par(op)

129
Standardized residuals
Residuals vs Fitted Normal Q−Q

2 4 6
Bens of Jura Knock Hill
Residuals

Knock Hill
40
Two Breweries
Bens of Jura
−20

−1
Black Hill

50 100 150 −2 −1 0 1 2

Fitted values Theoretical Quantiles


Standardized residuals

Standardized residuals
Scale−Location Residuals vs Leverage
Knock Hill Knock Hill

4
1.5

Bens of Jura
Black Hill Bens of Jura 0.5
Cook's Creag
distance

0
0.0

Dubh
0.5

50 100 150 0.00 0.10 0.20 0.30

Fitted values Leverage

Computational linear algebra in R


• Working with matrices, which mean linear algebra, is nowadays essential for a statistician (e.g., linear re-
gression, simulation, smoothing etc.).
• R has vectors and matrices as data types and it also contains many tools to perform multivariate calculus.
• From a computational point of view, many problems in linear algebra boil down to solving systems of linear
equations accurately and efficiently.

– In order to assess accuracy we need to understand properties of matrices.


– In order to efficiently compute the solution, we sometimes need to formulate problems differently than
we would from a strict mathematical point of view. E.g., to solve Ax = b the solution x = A−1 b is
computationally a no-go.

Vectors and matrices in R


• R normally makes no distinction between column or row vectors.
• It does allow matrices with one column or one row when the distinction is important.
• Matrices are stored internally as vectors with dimension attributes (that’s why they are homogeneous data
types . . . )

Recap
We have seen so far multiple operations on matrices:

• dimension of a matrix can be obtained using the function dim, ncol and nrow.

130
• subsetting is done using the function [ where in an linear algebra context often the argument drop = FALSE
is important (if we need vectors to be row or column vectors)
• applying functions rowwise or columnwise: apply(x, 1, function(x) ...), apply(x, 2, function(x) ...)
• specialized functions for rowwise and columwise summaries: colMeans, rowMeans, colSums, rowSums (faster
than apply).
• standardizing matrix using scale.

Let’s see some more useful functions . . .

Rowwise or columnwise matrix manipulations I


• If row or columns should be manipulated the most convenient function is sweep.
• It has the structure below, where X is again the matrix of interest, direction either for 1 = rowwise or 2 =
columnwise and function the function to be applied. The value is then the value that is “applied” to each
row / columns using function.

sweep(X, direction, value, function, ...)

Rowwise or columnwise matrix manipulations II


For example centering a matrix can be done as

> M <- matrix(1:12, ncol = 3)


> colMeans(M)
[1] 2.5 6.5 10.5
> Y <- sweep(M, 2, colMeans(M), "-")
> Y
[,1] [,2] [,3]
[1,] -1.5 -1.5 -1.5
[2,] -0.5 -0.5 -0.5
[3,] 0.5 0.5 0.5
[4,] 1.5 1.5 1.5

Attention!! arithmetic operations are done by default columnwise using recycling. . .

> ## This would be very wrong..


> M - colMeans(M)
[,1] [,2] [,3]
[1,] -1.5 -1.5 -1.5
[2,] -4.5 -4.5 7.5
[3,] -7.5 4.5 4.5
[4,] 1.5 1.5 1.5

Matrix properties: diag


The function diag is handy when either wanting to extract the diagonal elements of a square matrix or to create
diagonal matrices.

> X <- matrix(runif(9), 3, 3)


> X
[,1] [,2] [,3]
[1,] 0.7570 0.18870 0.6224
[2,] 0.7758 0.75827 0.4159

131
[3,] 0.5736 0.09268 0.7765
> diag(X)
[1] 0.7570 0.7583 0.7765

> diag(3)
[,1] [,2] [,3]
[1,] 1 0 0
[2,] 0 1 0
[3,] 0 0 1
> diag(1:3)
[,1] [,2] [,3]
[1,] 1 0 0
[2,] 0 2 0
[3,] 0 0 3

Matrix properties: transpose


To transpose a matrix in R the function to be used is t.

> X
[,1] [,2] [,3]
[1,] 0.7570 0.18870 0.6224
[2,] 0.7758 0.75827 0.4159
[3,] 0.5736 0.09268 0.7765
> t(X)
[,1] [,2] [,3]
[1,] 0.7570 0.7758 0.57359
[2,] 0.1887 0.7583 0.09268
[3,] 0.6224 0.4159 0.77651

Note that transposing matrices is computationally expensive!

Matrix properties: determinant and trace


The determinant of a square can be found using the det() function:

> det(X)
[1] 0.122

R has no built-in trace function in base R, but one can easily be defined:

> trace <- function(x) sum(diag(x))


> trace(X)
[1] 2.292

Triangular matrices I
• Functions lower.tri() and upper.tri() can be used to obtain the lower and upper parts of matrices.
• The output of these functions is a matrix of logical arguments where TRUE represents the relevant triangular
elements.

132
> lower.tri(X)
[,1] [,2] [,3]
[1,] FALSE FALSE FALSE
[2,] TRUE FALSE FALSE
[3,] TRUE TRUE FALSE

> lower.tri(X, diag = TRUE)


[,1] [,2] [,3]
[1,] TRUE FALSE FALSE
[2,] TRUE TRUE FALSE
[3,] TRUE TRUE TRUE

Triangular matrices II
We can use these functions to e.g., replace all upper triangular elements by zero.

> Xnew <- X


> Xnew[upper.tri(Xnew)] <- 0
> Xnew
[,1] [,2] [,3]
[1,] 0.7570 0.00000 0.0000
[2,] 0.7758 0.75827 0.0000
[3,] 0.5736 0.09268 0.7765

Matrix arithmetic
• Multiplication of a matrix X by a scalar a is the same as the multiplication of a vector with a scalar.
• Elementwise addition, multiplication etc can be done with +, * etc. (the dimensions must match)

> Y <- 2 * M
> Y + M
[,1] [,2] [,3]
[1,] 3 15 27
[2,] 6 18 30
[3,] 9 21 33
[4,] 12 24 36

Matrix multiplication
• For standard matrix multiplication the function is %*%. Let x and y be vectors and X and Y matrices.
• If vectors are used in the multiplication, R tries to figure out if they should be row or column vectors.
• If y %*% x when both have the same length, the inner product will be returned as a matrix.

> x <- 1:3


> y <- 5:7
> X <- matrix(1:9, 3, 3)
> Y <- matrix(10:21, 4, 3)
> x
[1] 1 2 3
> y
[1] 5 6 7

133
Matrix multiplication II

> X
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
> Y
[,1] [,2] [,3]
[1,] 10 14 18
[2,] 11 15 19
[3,] 12 16 20
[4,] 13 17 21

Matrix multiplication III

> x %*% y
[,1]
[1,] 38
> x %*% X
[,1] [,2] [,3]
[1,] 14 32 50
> X %*% x
[,1]
[1,] 30
[2,] 36
[3,] 42

Matrix multiplication III

> Y %*% X
[,1] [,2] [,3]
[1,] 92 218 344
[2,] 98 233 368
[3,] 104 248 392
[4,] 110 263 416

But note, also general matrix multiplication is expensive!

Functions for special matrix products.


• As mentioned general matrix products and taking transpose can be expensive operations.
• The general matrix product makes no use of information of the resulting object. E.g., if we know that the
result of a multiplication should be symmetric, we would ideally exploit this info in the computation.
• For products of the form X ⊤ X, XX ⊤ , X ⊤ Y and XY ⊤ has R therefore the special functions crossprod and
tcrossprod.
– X ⊤ X: crossprod(X)
– XX ⊤ : tcrossprod(X)
– X ⊤Y : crossprod(X, Y)
– XY ⊤ : tcrossprod(X, Y)

This functions should be used whenever possible!

134
Matrix inversion
To obtain the inverse of an invertible square matrix R has the function solve.

> X <- matrix(rnorm(9),3,3)


> Xinv <- solve(X)
> X %*% Xinv
[,1] [,2] [,3]
[1,] 1.000e+00 -4.274e-17 -4.107e-17
[2,] 6.173e-17 1.000e+00 9.353e-18
[3,] 2.146e-17 5.020e-17 1.000e+00

Computing the inverse is however computationally also expensive and there is hardly ever a good reason to
invert a matrix in statistical computations.

More on solve I
The function solve gives the inverse of a matrix actually only as a byproduct.
In general the purpose of the function is to solve systems of linear equations like

Ax = b ⇐⇒ x = A−1 b

where A is a square matrix and b can be either a vector or a matrix.


Then the call for solve would take the form solve(A, b) which is computationally much better than solve(A) %*%
b.
In R, solve() relies on the LU decomposition.

More on solve II
Assume A is an (n × n) matrix. A−1 is the solution to the matrix equation AA−1 = In . This can be seen as n
separate systems of linear equations in n unknowns, whose solution are the columns of the inverse.
It would be inefficient to solve first n systems of linear equations in order to obtain the inverse, for the purpose of
solving one, namely the original, system.
Also, finding the inverse necessitates a lot of calculations, which give opportunities for much more rounding errors to
distort our results.

Eigenvalue eigenvector decompostion


• In statistics one of the most important decompositions is the eigenvalue eigenvector decomposition

A = U DU ⊤ .

• The function for this in R is eigen.


• Computing the decomposition is easier when it is known that the matrix to decompose is symmetric. Further-
more the most expensive part of the decomposition is the computation of the eigenvectors.
• Therefore the function has the arguments symmetric and only.values which should be specified whenever
possible.

Eigenvalue eigenvector decompostion II

135
> X <- matrix(rnorm(300), ncol = 3)
> covX <- cov(X)
> eigen(covX, symmetric = TRUE)
eigen() decomposition
$values
[1] 1.3165 1.1031 0.9205

$vectors
[,1] [,2] [,3]
[1,] 0.6517 -0.18573 0.73540
[2,] 0.1693 0.98072 0.09768
[3,] -0.7394 0.06084 0.67055

Other useful linear algebra functions in R


Other useful function for linear algebra in R are:

• qr: QR-decomposition
• chol: Cholesky decomposition of a positive symmetric matrix
• svd: singular value decomposition
• outer : performs an operation on all possible pairs of elements of two vectors.
• kronecker: computes the Kronecker product.

But when working with matrices which have special properties like sparse matrices then it is worth checking the
Matrix package which has classes for the different types of matrices and can make then advantage of that knowledge
when for example doing decomposition or products.

Cholesky decomposition in R I
• If A is positive semidefinite, it possesses a square root such that B 2 = A.
• The Cholesky decomposition is similar, but the idea is to find an upper triangular matrix U such that:

U ⊤ U = A.

• In R the chol() function accomplishes this task.

Cholesky decomposition in R II

> H3 <- 1/cbind(1:3, 2:4, 3:5)


> chol.H3 <- chol(H3)
> chol.H3
[,1] [,2] [,3]
[1,] 1 0.5000 0.33333
[2,] 0 0.2887 0.28868
[3,] 0 0.0000 0.07454

> ## This should be the same as H3


> crossprod(chol.H3)
[,1] [,2] [,3]
[1,] 1.0000 0.5000 0.3333
[2,] 0.5000 0.3333 0.2500
[3,] 0.3333 0.2500 0.2000

136
Cholesky decomposition in R III
• The Cholesky decomposition can be employed to more efficiently to find the inverse of A:

A = U ⊤ U ⇒ A−1 = U −1 (U −1 )⊤

where computing U −1 can be done more easily given the triangular structure.
• A−1 can be obtained by chol2inv():

> chol2inv(chol.H3)
[,1] [,2] [,3]
[1,] 9 -36 30
[2,] -36 192 -180
[3,] 30 -180 180

Cholesky decomposition in R IV
• The Cholesky decomposition can be employed to solve linear systems of the form:

Ax = b ⇒ U ⊤ U x = b ⇒ U x = (U −1 )⊤ b.

• The solution can be obtained in a two step procedure:

1. Solve (U ⊤ y = b. This will satisfy (U −1 )⊤ b = y.


2. Solve U x = y

• The first system is lower triangular so forward substitution can be used. The function forwardsolve() can be
used for this.
• The second system is upper triangular so back substitution can be used using function backsolve().

Cholesky decomposition in R V
For the problem H3 x = b where b = (1, 2, 3)⊤ we have:

> b <- 1:3


> y <- forwardsolve(t(chol.H3), b)
> ## solution for x
> backsolve(chol.H3, y)
[1] 27 -192 210

We could have also solved this using:

> solve(H3, b)
[1] 27 -192 210

QR decomposition in R I
• Another way of decomposing a matrix A is through the QR decomposition:

A = QR

where Q is an orthogonal matrix and R is an upper triangular matrix.


• This decomposition can also be applied when A is non square.
• The function R is qr() and the output is an object of class qr.
• Functions qr.R() and qr.Q() can be applied to this object to obtain the Q and R matrices.

137
QR decomposition in R II
For more details on the output of qr() see the help page ?qr

> H3 <- 1/cbind(1:3, 2:4, 3:5)


> qr(H3)
$qr
[,1] [,2] [,3]
[1,] -1.1667 -0.6429 -0.450000
[2,] 0.4286 -0.1017 -0.105337
[3,] 0.2857 0.7293 0.003901

$rank
[1] 3

$qraux
[1] 1.857143 1.684241 0.003901

$pivot
[1] 1 2 3

attr(,"class")
[1] "qr"

QR decomposition in R III

> H3qr <- qr(H3)


> qr.R(H3qr)
[,1] [,2] [,3]
[1,] -1.167 -0.6429 -0.450000
[2,] 0.000 -0.1017 -0.105337
[3,] 0.000 0.0000 0.003901

> qr.Q(H3qr)
[,1] [,2] [,3]
[1,] -0.8571 0.5016 0.1170
[2,] -0.4286 -0.5685 -0.7022
[3,] -0.2857 -0.6521 0.7022

QR decomposition in R II
• The QR decomposition can be used to obtain more accurate solutions to linear systems. If we want to solve
(here A is an (n × n) matrix):
Ax = b ⇒ QRx = b ⇒ Rx = Q⊤ b
• Here Q⊤ b can be easily calculated. Then the system can be easily solved using backsubstitution as R is an
upper triangular matrix.
• Function qr.solve(A, b) can be used to solve the above system.
• If the system is over-determined, one can find the solution x which minimizes the distance between b and Ax
using qr.solve(A, b). (Note: this can be useful in a linear regression context where A would be replaced by
the design matrix, b by the response and x by the vector of coefficients).

138
Computational approaches to hypothesis testing

Hypothesis testing problems


Hypothesis testing is usually of the form
H0 vs. H1
where certain assumptions are made on the distribution of the data especially under H0 .
For example:

• One-sample location test: H0 : µ = µ0 vs. H1 : µ ̸= µ0 under the assumption f (x−µ) = f (−(x−µ)) (symmetric
density).
• Two-sample location test: H0 : F (X) = F (Y ) vs. H1 : F (X) ̸= F (Y ) where the difference is at most in the
locations between the two groups X and Y .
• Test of independence: H0 : FX,Y = FX FY vs. H1 : FX,Y ̸= FX FY .

Cook book
The goal of classical hypothesis testing is to answer the question: Given a sample and an apparent effect, what is the
probability of seeing such an effect by chance?

1. quantify the size of the apparent effect by choosing a test statistic.


2. define a null hypothesis, which is a model of the system based on the assumption that the apparent effect is
not real.
3. compute a p-value, which is the probability of seeing the apparent effect if the null hypothesis is true.
4. interpret the result: if the p-value is low (by convention, 5% is a typical threshold of statistical significance),
the effect is said to be statistically significant, which means that it is unlikely to have occurred by chance. In
that case we infer that the effect is more likely to appear in the larger population.

Functions in R: tests for numeric data


Here are some tests for numeric data:
t.test one and two sample t test
cor.test correlation test
var.test F test to compare two variances
bartlett.test Bartlett’s test for k variances
wilcox.test one and two sample Wilcoxon test
kruskal.test Kruskal-Wallis test
friedman.test Friedman’s test
ks.test one and two sample Kolmogorov-Smirnov test

Functions in R: Tests for nonnumeric data


Here are some tests for nonnumeric data:
binom.test binomial test
prop.test test to compare proportions
prop.trend.test Chi-square test for trend in proportions
fisher.test Fisher’s exact test for two-dimensional
contingency tables
chisq.test Chi-square test for contingency tables

139
Example: t-test in R I
• Student’s t-test is used to test in normal populations a hypothesis about the location or to compare the location
of two normal populations.
• In the latter case one must furthermore decide if the two populations have the same variance or not and if the
test is based on paired or independent observations.

Example: t-test in R II
• All these cases are considered in the function t.test.

– For the one sample case only a numeric vector has to be submitted and by default the hypothetical
location is the origin.
– For the two sample case one can submit either two numeric vectors or a formula where the independent
variable is a factor with two levels.
– In the two sample case the default setting assumes that the samples are independent and have different
variances.

• Useful functions in this context are also

– power.t.test: used to compute the power of the one- or two- sample t test, or determine parameters to
obtain a target power.
– pairwise.t.test: calculates pairwise comparisons between group levels with corrections for multiple
testing

One sample t-test example I


We use the crabs data again to test H0 : µ = 10 vs. H1 : µ ̸= 10

> data("crabs", package = "MASS")


> one.samp.t <- t.test(crabs$RW, mu = 10)
> one.samp.t

One Sample t-test

data: crabs$RW
t = 15, df = 199, p-value <2e-16
alternative hypothesis: true mean is not equal to 10
95 percent confidence interval:
12.38 13.10
sample estimates:
mean of x
12.74

One sample t-test example II


Lets check the structure of the output

> str(one.samp.t)
List of 10
$ statistic : Named num 15
..- attr(*, "names")= chr "t"
$ parameter : Named num 199
..- attr(*, "names")= chr "df"
$ p.value : num 1.12e-34
$ conf.int : num [1:2] 12.4 13.1

140
..- attr(*, "conf.level")= num 0.95
$ estimate : Named num 12.7
..- attr(*, "names")= chr "mean of x"
$ null.value : Named num 10
..- attr(*, "names")= chr "mean"
$ stderr : num 0.182
$ alternative: chr "two.sided"
$ method : chr "One Sample t-test"
$ data.name : chr "crabs$RW"
- attr(*, "class")= chr "htest"

One sample t-test example III


The object obtained from the .test function is a list so we can extract its elements using:

> one.samp.t$statistic
t
15.05
> one.samp.t$p.value
[1] 1.116e-34

Two sample t-test example I


We will use the formula method to get the two sample test to check if RW differs between the sexes.
H0 : µF = µM , vs. µF ̸= µM ,

> t.test(RW ~ sex, data = crabs)

Welch Two Sample t-test

data: RW by sex
t = 4.3, df = 188, p-value = 3e-05
alternative hypothesis: true difference in means between group F and group M is not equal to 0
95 percent confidence interval:
0.8086 2.1854
sample estimates:
mean in group F mean in group M
13.49 11.99

Two sample t-test example II


The same result could be also obtained using

> RW.males <- crabs$RW[crabs$sex == "M"]


> RW.females <- crabs$RW[crabs$sex == "F"]
> t.test(RW.males,RW.females)

Welch Two Sample t-test

data: RW.males and RW.females


t = -4.3, df = 188, p-value = 3e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2.1854 -0.8086
sample estimates:
mean of x mean of y
11.99 13.49

Note: Welch test is performed by default (assumes unequal variances).

141
Types of errors
• In classical hypothesis testing, an effect is considered statistically significant if the p-value is below some threshold,
commonly 5% (known as significance level).
• Assume you have a testing problem with null hypothesis H0 vs. alternative H1 which is done at the significance level α.
• Two errors can basically occur during testing:

– Type 1 error: the effect is actually due to chance, but we will wrongly consider it significant (H0 is rejected but
true).
– Type 2 error: the effect is real but the test fails (H0 is not rejected but false).

• Probability of Type 1 error is the false positive rate.

The false positive rate


• Relatively easy to compute: if the threshold is α, the false positive rate is α.

– If there is no real effect, the null hypothesis is true, so we can compute the distribution of the test statistic by
simulating the null hypothesis. Call this distribution CDFT .
– Each time we run an experiment, we get a test statistic t which is drawn from CDFT Then we compute a p-value,
which is the probability that a random value from CDFT exceeds t, so that’s 1 − CDFT (t).
– The p-value is less than 5% if CDFT (t) is greater than 95%; that is, if t exceeds the 95th percentile. And how
often does a value chosen from CDFT (t) exceed the 95th percentile? 5% of the time.
– → If the null is true, p-value has a uniform distribution1 over the interval [0,1]

Simulation in R: one sample t-test behavior under null I


Assume we have normally distributed data from N (µ, 1) and we have the null H0 : µ = 0.
We draw samples of size 50 from the null N (0, 1) repeatedly m times, perform the one sample t-test and we store 5000 p-values.

> set.seed(1)
> n <- 50
> m <- 5000
> Pvalue <- replicate(m, t.test(rnorm(n))$p.value)
> summary(Pvalue)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 0.243 0.496 0.496 0.745 1.000

Simulation in R: one sample t-test behavior under null II

> hist(Pvalue, freq = FALSE)

1 based on a continuous test statistic

142
Histogram of Pvalue
1.0
0.8
Density

0.6
0.4
0.2
0.0

0.0 0.2 0.4 0.6 0.8 1.0

Pvalue

Power of a test
• The power of a test is defined as 1 - false negative rate.
• The false negative rate is harder to compute because it depends on the actual effect size, and normally we don’t know
that.
• One option is to compute a rate conditioned on a hypothetical effect size.

Simulation in R: power of the one sample t-test


Consider the example from before. We assume several scenarios: the real mean is ∆1 = 0, ∆2 = 0.1, ∆3 = 0.2, ∆4 = 0.3.
We draw samples of size 50 from N (∆j , 1) repeatedly m times.
Assume α = 5%.

> n <- 30; m <- 1000; alpha <- 0.05


> Delta1 <- 0; Delta2 <- 0.1; Delta3 <- 0.2; Delta4 <- 0.3
> set.seed(1)
> PvalueD1 <- replicate(m, t.test(rnorm(n) + Delta1)$p.value)
> set.seed(1)
> PvalueD2 <- replicate(m, t.test(rnorm(n) + Delta2)$p.value)
> set.seed(1)
> PvalueD3 <- replicate(m, t.test(rnorm(n) + Delta3)$p.value)
> set.seed(1)
> PvalueD4 <- replicate(m, t.test(rnorm(n) + Delta4)$p.value)
> POWER <- colMeans(
+ cbind(PvalueD1, PvalueD2, PvalueD3, PvalueD4) < alpha)
> names(POWER) <- c("D0","D1","D2","D3")
> POWER
D0 D1 D2 D3
0.042 0.072 0.158 0.350

143
Changes under the H0
Given then the testing problem and the assumptions made, test statistics can be computed and are compared to some theoret-
ically derived critical value often based on further assumptions.
The null hypothesis however often allows that, it if would be true, then “relabeling” of the data would be possible without
changing “anything”.

• One-sample location test: We can change the sign of x (after centering wrt to µ0 ) and nothing should change.
• Two-sample location test: We can switch observations between the two groups without changing anything.
• Test of independence: We can match the “X” part with the “Y” part from different observations.

Sign-change, permutation and randomization tests


If one goes through all possible label changes one can compare the test statistics computed from all relabeled samples to the
one from the original sample and under H0 the test statistic should not “stick” out. If it does however look extreme then the
null hypothesis might not be true. Therefore we can base the test decision on this.
Such tests are usually called sign-change and permutation tests.
The number of possible permutations can of course be so high that going through all combinations is even with modern
computers not possible. In such cases usually just m samples are generated where the labels are permuted and the decision is
based on this random subset. In that case the tests are often called randomization tests.

Randomization test procedure


Assume you have a testing problem and know how to relabel the observations. Then a randomization test has the following
steps:

1. Compute the test statistic T for the observed data.


2. For m replications compute the test statistic T i , i = 1, . . . , m where each of them is based on a randomly relabeled set
of the observations.
3. The empirical p-value is the relative frequency of T “more extreme than” T i :
Pm
I(T more extreme thanT i )
p= i=1
m
4. Reject H0 at level α if p < α.

more extreme thanT i )


Pm
1+ I(T
Note that often p = i=1
1+m
to avoid p = 0.

One sample location testing problem


Assume we make the following assumptions:

1. our random sample is iid.


2. the underlying density is symmetric around the location.

And we have the testing problem:


H0 : The symmetry center is = µ0 vs. H1 : The symmetry center is ̸= µ0

Assuming then either that the data is normal or that the sample size is large with finite second moments we can use the one
sample t-test.

One sample t-test


Remember, the test statistic of the one sample t-test for a sample of size n is:
x̄ − µ0
t= p
s/ (n)
and “is” t-distributed with n − 1 degrees of freedom, where x̄ is the sample mean and s the sample standard deviation.
Hence for a two-sided alternative we reject H0 at level α if |t| > tn−1,1−α/2 and the p-value is:
p = 2 ∗ P (X ≥ |t|),
where X has a t-distribution with n − 1 degrees of freedom.

144
Randomized one sample t-test
If however the sample size is small and normality cannot be assumed, it is better to use the randomized one sample t-test.
So in this set-up because of the symmetry under the null the signs of x − µ0 can be “relabeled’ ’. Therefore the randomization
sign-change test version of this test has the following steps:

1. compute y = x − µ0 .
2. change randomly the√signs of y.
3. compute ti = ȳ/(sy / n).
4. compute how often ti is more extreme than t, where one has to remember that we test two-sided!

R function for t-test


Of course R has the in-built function t.test for the one sample t-test. But just do demonstrate how simply such a function is
written:

> my.ttest <- function(x, mu = 0){


+ n <- length(x)
+ X <- x-mu
+ SD <- sd(X)
+ test.statistic <- mean(X)/(SD/sqrt(n))
+ p.value <- 2*(1 - pt(abs(test.statistic), df = n-1))
+ res <- list(statistic = test.statistic,
+ p.value = p.value, df = n-1)
+ return(res)
+ }

R function for randomized t-test


Now the function for the randomized test:

> test.statistic.t0 <- function(X) mean(X)/(sd(X)/sqrt(length(X)))


> my.ttest.perm <- function(x,mu = 0, m = 1000) {
+ n <- length(x)
+ X <- x-mu
+ SD <- sd(X)
+ test.statistic <- mean(X)/(SD/sqrt(n))
+ PT <- replicate(m, test.statistic.t0(
+ sample(c(1,-1), n,
+ replace=TRUE)*X))
+ p.value <- mean(abs(test.statistic)<abs(PT))
+ res <- list(statistic = test.statistic,
+ p.value = p.value, replications = m)
+ return(res)
+ }

One sample t-test comparison


Let us specify H0 : µ = 2 and compare the tests in different scenarios:

1. The data is N(2,1).


2. The data is N(2.5,1).

And the sample size is always 30.

> set.seed(4321)
> n <- 30
> x1 <- rnorm(n,2,1)
> x2 <- x1 + 0.5

145
One sample t-test comparison II

> my.ttest(x1, mu=2)$p.value


[1] 0.7894
> t.test(x1, mu=2)$p.value
[1] 0.7894
> my.ttest.perm(x1,mu=2)$p.value
[1] 0.793
> my.ttest(x2, mu=2)$p.value
[1] 0.006206
> t.test(x2, mu=2)$p.value
[1] 0.006206
> my.ttest.perm(x2,mu=2)$p.value
[1] 0.004

One sample t-test comparison III


Now the same hypothesis but with a different distribution

1. The data is t3 + 2.
2. The data is t3 + 2.5.

And the sample size is always 30.

> set.seed(4321)
> n <- 30
> y1 <- rt(n, df=3) + 2
> y2 <- y1 + 0.5

One sample t-test comparison IV

> my.ttest(y1, mu=2)$p.value


[1] 0.3326
> t.test(y1, mu=2)$p.value
[1] 0.3326
> my.ttest.perm(y1,mu=2)$p.value
[1] 0.336
> my.ttest(y2, mu=2)$p.value
[1] 0.002954
> t.test(y2, mu=2)$p.value
[1] 0.002954
> my.ttest.perm(y2,mu=2)$p.value
[1] 0.002

Two sample location problem


In the following we will consider the two sample location problem for independent samples
As setup consider having the two samples
x1 , . . . , xm and y1 , . . . , yn
with cdfs FX and FY .
For simplicity we assume that all sampled values are distinct.
We assume that V ar(X) = V ar(Y ) and are interested in the testing problem
H0 : FX = FY vs. H0 : FX ̸= FY .

146
Tests for this problem
There are many tests for this problem and we assume the two sample t-test is well known.
In the following we will consider the (non-parametric) alternatives:

• sign or median test


• Wilcoxon rank sum test

Signs and ranks


For the nonparametric tests the principles of signs and ranks are of importance.
Let z1 , . . . , zn be a sample of n unique data points then denote:

0 for zi < 0
n
S(zi ) =
1 for zi ≥ 0
as the sign of Zi and

n
X
R(zi ) = S(zi − zj )
j=1
as the rank of zi .

Two sample sign test


Let M be the median of the combined sample x1 , . . . , xm , y1 , . . . , yn .
Then denote
Pm
• K= S(xi − M ) (number of observations in first sample which are greater of equal than M )
Pni=1
• L= j=1
S(yj − M ) (number of observations in second sample which are greater of equal than M )

Then under H0 should be K close to m/2 and L to n/2.


Using the notation N = n + m and s = K + L under H0 has K has a hypergeometric distribution which can be approximated
using a standard normal distribution via
K − ms/N
Z= p .
mns(N − s)/N 3

Wilcoxon ranksum test


One of the most famous nonparametric tests is the Wilcoxon rank sum test.
Denote here R(xi ), i = 1, . . . , m as the ranks of the observations from the first group among the observations from the combined
sample x1 , . . . , xm , y1 , . . . , yn .
The test statistic is then
m
X
WN = R(xi )
i=1
Under H0 we have
m(N + 1) mn(N + 1)
E(WN ) = and V ar(WN ) =
2 12

Wilcoxon ranksum test II


The exact distribution of WN is available but if one of the sample sizes exceeds 25 then usually the standard normal approxi-
mation of
WN − m(N + 1)/2
Z= p ,
mn(N + 1)/12

is considered quite good.


The Wilcoxon rank sum test is equivalent to Mann-Whitney U test.

147
Two sample sign test in R

> ST2S <- function(x, y) {


+ m <- length(x)
+ n <- length(y)
+ z <- c(x, y)
+ N <- m + n
+ M <- median(z)
+ K <- sum((x - M) >= 0)
+ L <- sum((y - M) >= 0)
+ s <- K + L
+ EK <- m * s / N
+ SDK <- sqrt(m) * sqrt(n) * sqrt(s) * sqrt(N - s) / Nˆ(3/2)
+ Z <- (K - EK)/SDK
+ p.val <- 2 * (1 - pnorm(abs(Z)))
+ list(K = K, Z = Z, p.val = p.val)
+ }

Example I: Data set


6
4
2
0
−2

x y

Example I: Two Sample t-test

> t.test(x, y)

Welch Two Sample t-test

data: x and y
t = -35, df = 957, p-value <2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2.096 -1.871
sample estimates:
mean of x mean of y
-0.02839 1.95483

148
Example I: Wilcoxon rank sum test

> wilcox.test(x, y)

Wilcoxon rank sum test with continuity correction

data: x and y
W = 44624, p-value <2e-16
alternative hypothesis: true location shift is not equal to 0

Example I: Two sample sign test

> ST2S(x, y)
$K
[1] 294

$Z
[1] -22.57

$p.val
[1] 0

Example II: Data set

mean of x2
mean of y2
6
5
4
3
2
1
0

x2 y2

149
Example II: Two Sample t-test

> t.test(x2, y2)

Welch Two Sample t-test

data: x2 and y2
t = 8.4, df = 1351, p-value <2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.2132 0.3432
sample estimates:
mean of x mean of y
0.9859 0.7077

Example II: Wilcoxon rank sum test

> wilcox.test(x2, y2)

Wilcoxon rank sum test with continuity correction

data: x2 and y2
W = 261174, p-value = 0.2
alternative hypothesis: true location shift is not equal to 0

Example II: Two sample sign test

> ST2S(x2, y2)


$K
[1] 516

$Z
[1] 1.753

$p.val
[1] 0.07965

Which test to choose?


So which test should we choose? Be better safe than sorry and always use non-parametric tests (as they relax the assumptions
of parametric tests)?
No free lunch . . .
Approach: compare tests based on asymptotic relative efficiency2 .

Asymptotic relative efficiency


For a given α and a given power γ ∈ (α, 1) asymptotic relative efficiency of test 1 with respect to test 2 is
nν,1
lim ,
ν→∞ nν,2
where nν,1 is the minimal number of observations such that πnν,1 (0) ≤ α and πnν,1 (δν ) ≥ γ and πn (δ) is the power function
for effect δ.
Intuitively: an ARE of 2 means that twice as many observations are needed by test 1, than by test 2 to achieve the same power.
2 For more details see e.g., these lecture notes and dissertation

150
Efficiency comparisons
Considering the data follows normal distributions with only the location change, the asymptotic relative efficiency (ARE) for
test comparisons are:

1. t vs K: 0.64
2. t vs WN : 0.95

However when the data has heavy tails then WN and K are more efficient than t.

Simulation study for efficiency comparisons


We assume we have normally distributed data with fixed variance σ. In this example we are testing whether the difference in
location (lets call it θ) between the groups is zero or not (H0 : θ = θ0 = 0 vs H1 : θ ̸= 0).

• M is the number of repetitions


• n is the number of observation in group 1
• m is the number of observations in group 2
• δ is a hypothesized effect size and it is assumed that under the alternative, the effect will converge to θ0 = 0 as n
increases3 .
δ
θ1 = θ0 + √
N

Simulation study for efficiency comparisons in R

> SIMU <- function(M, n, m, delta, sigma = 1, seed = 1) {


+ TT <- numeric(M)
+ ST <- numeric(M)
+ WT <- numeric(M)
+ N <- n + m
+ Delta <- delta/sqrt(N)
+ set.seed(seed)
+ for (i in 1:M) {
+ x <- rnorm(m, sd=sigma)
+ y <- rnorm(n, mean = Delta, sd=sigma)
+ TT[i] <- t.test(x, y, var.equal=TRUE)$p.value
+ ST[i] <- ST2S(x, y)$p.val
+ WT[i] <- wilcox.test(x, y, exact=FALSE)$p.value
+ }
+ RES <- data.frame(n=n, m=m, delta=delta, sigma=sigma,
+ TT=TT, ST=ST, WT=WT)
+ RES
+ }

Simulation study for efficiency comparisons II

> RES0 <- SIMU(1000, 30, 30, delta=0)


> RES1 <- SIMU(1000, 30, 30, delta=2.86)
> RES2 <- SIMU(1000, 30, 30, delta=4.50)
> RES3 <- SIMU(1000, 30, 30, delta=6.60)
> P0 <- colMeans(RES0[,5:7] < 0.05)
> P1 <- colMeans(RES1[,5:7] < 0.05)
> P2 <- colMeans(RES2[,5:7] < 0.05)
> P3 <- colMeans(RES3[,5:7] < 0.05)

3 As sample size tends to infinity, it is assumed that the alternative hypothesis approaches the null hypothesis to keep the

powers of the tests bound away from one

151
Simulation study for efficiency comparisons II

> P0
TT ST WT
0.050 0.067 0.051
> P1
TT ST WT
0.291 0.243 0.270
> power.t.test(30, 2.86/sqrt(60))

Two-sample t test power calculation

n = 30
delta = 0.3692
sd = 1
sig.level = 0.05
power = 0.2899
alternative = two.sided

NOTE: n is number in *each* group

Simulation study for efficiency comparisons III

> P2
TT ST WT
0.581 0.478 0.550
> power.t.test(30, 4.50/sqrt(60))

Two-sample t test power calculation

n = 30
delta = 0.5809
sd = 1
sig.level = 0.05
power = 0.5997
alternative = two.sided

NOTE: n is number in *each* group

Simulation study for efficiency comparisons IV

> P3
TT ST WT
0.899 0.782 0.879
> power.t.test(30, 6.60/sqrt(60))

Two-sample t test power calculation

n = 30
delta = 0.8521
sd = 1
sig.level = 0.05
power = 0.9006
alternative = two.sided

NOTE: n is number in *each* group

152
Simulation study for efficiency comparisons V

> P1[2]/P1[1]
ST
0.8351
> P2[2]/P2[1]
ST
0.8227
> P3[2]/P3[1]
ST
0.8699
> P1[3]/P1[1]
WT
0.9278
> P2[3]/P2[1]
WT
0.9466
> P3[3]/P3[1]
WT
0.9778

Simulation study for efficiency comparisons VI

> RES0b <- SIMU(1000, 100, 100, delta=0)


> RES1b <- SIMU(1000, 100, 100, delta=2.86)
> RES2b <- SIMU(1000, 100, 100, delta=4.50)
> RES3b <- SIMU(1000, 100, 100, delta=6.60)
> P0b <- colMeans(RES0b[,5:7] <0.05)
> P1b <- colMeans(RES1b[,5:7] <0.05)
> P2b <- colMeans(RES2b[,5:7] <0.05)
> P3b <- colMeans(RES3b[,5:7] <0.05)

Simulation study for efficiency comparisons VII

> P0b
TT ST WT
0.043 0.062 0.042
> P1b
TT ST WT
0.317 0.245 0.291
> P2b
TT ST WT
0.630 0.489 0.615
> P3b
TT ST WT
0.917 0.808 0.904

Simulation study for efficiency comparisons VIII


delta n=30 n=100
ST 2.86 0.8351 0.7729
ST 4.50 0.8227 0.7762
ST 6.60 0.8699 0.8811
WT 2.86 0.9278 0.9180
WT 4.50 0.9466 0.9762
WT 6.60 0.9778 0.9858

153
Numerical optimization and root finding in R
Numerical optimization
• In many areas of statistics and mathematics we have to solve problems like: given a function f () which value of x makes
f (x) as small or as large as possible?
• E.g. In statistical modeling we may want to find the a set of parameters for a model which minimizes the expected
prediction errors.
• In some cases we might also have some constraints on x, e.g., the parameters shall be non-negative.
• Use of derivatives and linear algebra often lead to solutions for these problems, but not nearly always. This is where
numerical optimization comes in.

Root finding
• Root finding and unconstrained optimization are closely related: solving f (x) = 0 can be accomplished via minimizing
||f (x)||2 and unconstrained optima of f must be critical points i.e., solve ▽f (x) = 0.
• For linear least squares problems this can be solved “exactly” using techniques for linear algebra.
• Other problems can typically only be solved as limits of iterations xk = g(xk−1 ).

Newton’s method for root finding


• Newton’s method for root finding is a popular numerical method to find the root of the algebraic equation(s):
f (x) = 0

• If f is smooth and Jf (x) = [∂fi /∂xj (x)], the idea is based on the Taylor approximation f (xk ) ≈ f (xk−1 ) + (xk −
xk−1 )Jf (xk−1 ).
• If started close enough to the root, the following iteration will converge to a root of the above equation:
xk = xk−1 − Jf−1 (xk−1 )f (xk−1 ) = g(xk−1 ), x0 = initial guess

• Note: The computational form is Jf (xk−1 )sk−1 = −f (xk−1 ), xk = xk−1 + sk−1 .

One dimensional example I


Goal: find root of f (x) = x3 + 15x − 4. If x0 is close enough to one of the three roots, then Newton’s method should converge
to a root.

> algNewton <- function(x0, f, f.prime) {


+ x <- x0
+ f.val <- f(x)
+ tol <- 0.00000001
+ while (abs(f.val) > tol) {
+ f.prime.val <- f.prime(x)
+ x <- x - f.val/f.prime.val
+ f.val <- f(x)
+ }
+ return(x)
+ }
>
> xstar <- algNewton(1.5,
+ f = function(x) xˆ3 + 15 * x - 4,
+ f.prime = function(x) 3 * xˆ2 + 15)
> xstar
[1] 0.2654

Function curve()
The function curve draws a curve corresponding to a function over the interval [from, to].

154
> f <- function(x) xˆ3 + 15 * x - 4
> curve(f, -5, 5)
> abline(h = 0)
> abline(v = xstar, lty = 2)
200
100
f(x)

0
−200

−4 −2 0 2 4

x

Function has 1 real root. The analytical solution is 2 − 3.

One dimensional example II


Goal: find root of f (x) = x3 − 15x − 4. If x0 is close enough to one of the three roots, then Newton’s method should converge
to a root.

> f <- function(x) xˆ3 - 15 * x - 4


> curve(f, -5, 5)
> abline(h = 0)
40
20
0
f(x)

−40

−4 −2 0 2 4

x
√ √
Function has 3 real roots. The analytical solution is −2 − 3, −2 + 3 and 4.

155
One dimensional example II

> algNewton(x0 = 0, f = f, f.prime = function(x) 3 * xˆ2 - 15)


[1] -0.2679
> algNewton(x0 = 3, f = f, f.prime = function(x) 3 * xˆ2 - 15)
[1] 4
> algNewton(x0 = -3, f = f, f.prime = function(x) 3 * xˆ2 - 15)
[1] -3.732

Quasi-Newton methods for root finding


• Newton’s method is rather costly, especially if x has large dimension:

– In each step O(n2 ) derivatives need to be computed (exactly of numerically)


– In each step an n × n system of linear system needs to be solved.

• Quasi-Newton methods which replace Jf (xk−1 ) with another Bk which is less costly to compute or to invert can be
employed for root finding. The most famous such method is Broyden’s method.

– It considers approximations Bk which exactly satisfy the secant equation f (xk ) = f (xk−1 ) + Bk (xk − xk−1 ).
– The problem ends up being a convex quadratic optimization problem with linear constraints.

• For very large n spectral methods such as Barzilai-Borwein can be used.

Tools in R
• These methods are only available in R extension packages.
• Package nleqslv has function nleqslv() which provides the Newton and Broyden methods.
• Package BB has function BBsolve() for Barzilai-Borwein solvers.

Two-dimensional example using nleqslv() I


Consider the nonlinear system which has solution x1 = x2 = 1.
x21 + x22 = 2, ex1 −1 + x32 = 2

> fn <- function(x) {


+ c(x[1]ˆ2 + x[2]ˆ2 - 2, exp(x[1] - 1) + x[2]ˆ3 - 2)
+ }

Two-dimensional example using nleqslv() II

> x0 <- c(2, 0.5) # Initial guess


> nleqslv::nleqslv(x0, fn, method = "Broyden")
$x
[1] 1 1

$fvec
[1] 1.500e-09 2.056e-09

$termcd
[1] 1

$message
[1] "Function criterion near zero"

$scalex
[1] 1 1

156
$nfcnt
[1] 12

$njcnt
[1] 1

$iter
[1] 10

Two-dimensional example using nleqslv() III

> x0 <- c(2, 0.5) # Initial guess


> nleqslv::nleqslv(x0, fn, method = "Newton")
$x
[1] 1 1

$fvec
[1] 6.842e-10 1.764e-09

$termcd
[1] 1

$message
[1] "Function criterion near zero"

$scalex
[1] 1 1

$nfcnt
[1] 6

$njcnt
[1] 5

$iter
[1] 5

Both methods deliver the correct solution, Newton’s method needs less iterations.

Two-dimensional example using nleqslv() IV


We can also provide the Jacobian matrix Jf (x):

> jac <- function(x) {


+ matrix(c(2 * x[1], exp(x[1] - 1),
+ 2 * x[2], 3 * x[2]ˆ2), nrow = 2L)
+ }
>
> nleqslv::nleqslv(x0, fn, jac = jac, method = "Newton")
$x
[1] 1 1

$fvec
[1] 6.839e-10 1.762e-09

$termcd
[1] 1

$message
[1] "Function criterion near zero"

$scalex

157
[1] 1 1

$nfcnt
[1] 6

$njcnt
[1] 5

$iter
[1] 5

Optimization
In this section we will cover some algorithms and types of optimization problems and show how they can be implemented in R.

• Newton-Raphson
• Linear programming
• Quadratic programming

More on optimization in R can be found on the CRAN Task View


https://cran.r-project.org/web/views/Optimization.html

Newton-Raphson I
• If the function to be minimized has two continuous derivatives and we know how to evaluate them, we can employ the
Newton-Raphson algorithm.
• If we have a guess x0 at a minimizer, we use a local quadratic approximation for f (equivalently, a linear approximation
for ▽f ):
xk = xk−1 − Hf−1 (xk−1 ) ▽ f (xk−1 )
where Hf (x) = [∂ 2 f /∂xi ∂xj (x)] is the Hessian matrix of f at x.

Newton-Raphson II
• It can be shown that NR algorithm converges to a local minima if x0 is close enough to the solution.
• In practice it can be quite tricky:
– If the second derivative at xk−1 is 0 then there is no solution to the Taylor series approximation
– If xk−1 is too far from the solution, the Taylor approximation can be so inaccurate that f (xk ) is larger than
f (xk−1 ). In this case one can replace xk by (xk + xk−1 )/2.

Newton-Raphson example in R
f (x) = e−x + x4

> fprime <- function(x) -exp(-x) + 4 * xˆ3


> fprimeprime <- function(x) exp(-x) + 12 * xˆ2
> algNewton(1, f = fprime, f.prime = fprimeprime)
[1] 0.5283

Built-in functions
• In R there are several general purpose optimizers.
• For one-dimensional optimization optimize() can be used.

• Multidimensional optimizers:
– optim() which implements variations of Newton Raphson, Nelder-Mead’s simplex method and others.
– nlminb()
– nlm()
• If linear inequalities should be used on the parameters constrOptim() can be used.

158
Example: optimize()
f (x) = |x − 3.5| + |x − 2| + |x − 1|

> f <- function(x) abs(x - 3.5) + abs(x - 2) + abs(x - 1)


> optimize(f, interval = c(0, 10))
$minimum
[1] 2

$objective
[1] 2.5

Example: optim()
f (a, b) = (a − 1) + 3.2/b + 3 log(Γ(a)) + 3a log(b)

> f <- function(x) (x[1] - 1) + 3.2/x[2] +


+ 3 * log(gamma(x[1])) + 3 * x[1] * log(x[2])
> optim(c(1, 1), f)
$par
[1] 1.400 0.762

$value
[1] 3.099

$counts
function gradient
47 NA

$convergence
[1] 0

$message
NULL

Linear programming
When the function to optimize is linear and when the constraints we impose on the values x are linear, the problem is called
linear programming.

min c1 x1 + . . . + ck xk ,
x1 ,...,xk
subject to:
a11 x1 + . . . + a1k xk ≥ b1
..
.
am1 x1 + . . . + amk xk ≥ bm
and x1 ≥ 0, . . . , xk ≥ 0.

Linear programming in R
• Function lp() from lpSolve package can be used ro solve linear programming problems.
– argument objective.in - the vector of coefficients of the objective function.
– argument const.mat - a matrix containing the coefficients of the x variables in the left hand side of the constraints;
each row corresponds to a constraint.
– argument const.dir - a character vector containing the direction of the inequality constraints (>=, ==, <=).
– argument const.rhs - a vector containing the constants on the right-hand side of the constraints.
• It is based on the revised simplex method.

159
Linear programming pollution example I
• A company has developed two procedures for reducing sulfur dioxide and carbon dioxide emissions from its factory.
• The first procedure reduces equal amounts of each gas at a per unit cost of $5.
• The second procedure reduces the same amount of sulfur dioxide as the first method, but reduces twice as much carbon
dioxide gas; the per unit cost of this method is $8.
• The company is required to reduce sulfur dioxide emissions by 2 million units and carbon dioxide emissions by 3 million
units.
• What combination of the two emission procedures will meet this requirement at minimum cost?

Linear programming example II


• Let x1 denote the amount of the first procedure to be used, and let x2 denote the amount of the second procedure to be
used. For convenience, we will let these amounts be expressed in millions of units.
• Then the cost (in millions of dollars) can be expressed as
C = 5x1 + 8x2 .

• Since both methods reduce sulfur dioxide emissions at the same rate, the number of units of sulfur dioxide reduced will
then be x1 + x2 .
• Noting that there is a requirement to reduce the sulfur dioxide amount by 2 million units, we have the constraint
x1 + x2 ≥ 2.
• The carbon dioxide reduction requirement is 3 million units, and the second method reduces carbon dioxide twice as fast
as the first method, so we have the second constraint x1 + 2x2 ≥ 3.
• Finally, we note that x1 and x2 must be nonnegative, since we cannot use negative amounts of either procedure.

Linear programming example III

> A <- matrix(c(1, 1, 1, 2), nrow = 2L)


> b <- c(2, 3)
> obj <- c(5, 8)
> res <- lpSolve::lp(
+ objective.in = obj, const.mat = A,
+ const.dir = c(">=", ">="),
+ const.rhs = b)
> res$solution
[1] 1 1

Note: Setting direction = "max" will allow the specification of maximization problems.

Multiple optima
It sometimes happens that there are multiple solutions for a linear programming problem. The problem has a solution at (1, 1)
and (3, 0).
min 4x1 + 8x2 ,
x1 ,x2
subject to:
x1 + x2 ≥ 2, x1 + 2x2 ≥ 3, x1 ≥ 0, . . . , xk ≥ 0

> res <- lpSolve::lp(direction = "min",


+ objective.in = c(4, 8),
+ const.mat = matrix(c(1, 1, 1, 2), nrow = 2L),
+ const.dir = c(">=", ">="),
+ const.rhs = c(2, 3))
> res$solution
[1] 3 0

The lp() function does not alert the user to the existence of multiple minima.

160
Infeasibility
In this example it is clear that the constraints cannot be simultaneously satisfied:

min 5x1 + 8x2 ,


x1 ,x2
subject to:
x1 + x2 ≥ 2, x1 + 2x2 ≤ 1, x1 ≥ 0, . . . , xk ≥ 0

> res <- lpSolve::lp(


+ objective.in = c(5, 8),
+ const.mat = matrix(c(1, 1, 1, 2), nrow = 2L),
+ const.dir = c(">=", "<="),
+ const.rhs = c(2, 1))
> res
Error: no feasible solution found

Unboundednes
In some case the objective and the constraints give rise to an unbounded solution
max 5x1 + 8x2 ,
x1 ,x2

subject to:
x1 + x2 ≥ 2, x1 + 2x2 ≥ 3, x1 ≥ 0, . . . , xk ≥ 0

> res <- lpSolve::lp(direction = "max",


+ objective.in = c(5, 8),
+ const.mat = matrix(c(1, 1, 1, 2), nrow = 2L),
+ const.dir = c(">=", ">="),
+ const.rhs = c(2, 3))
> res
Error: status 3

Quadratic programming I
• Linear programming problems are a special case of optimization problems in which a possibly nonlinear function is
minimized subject to constraints.
• Such problems are typically more difficult to solve and are beyond the scope of this course; an exception is the case where
the objective function is quadratic and the constraints are linear.
• A quadratic programming problem with m constraints is often of the form:
1 ⊤
min x Dx − d⊤ x
x 2
subject to constraints A⊤ x ≥ b. Here x is a vector of p unknowns, D is a positive definite p × p matrix, d is vector of
length p, A is a p × m matrix, and b is a vector of length m.

Quadratic programming II
In R the solve.QP() function of the quadprog package can be used to solve quadratic programs.

• Dmat - a matrix containing the elements of the matrix (D) of the quadratic form in the objective function
• dvec - a vector containing the coefficients of the decision variables x in the objective function
• Amat - a matrix containing the coefficients of the decision variables in the constraints; each row of the matrix corresponds
to a constraint
• bvec - a vector containing the constants given on the right-hand side of the constraints
• mvec - a number indicating the number of equality constraints. By default, this is 0. If it is not 0, the equality constraints
should be listed ahead of the inequality constraints.

161
Quadratic programming example I
• Assume we want to find out how much money to invest in a set of n stocks if we want to find the global minimum
variance portfolio σp,n
2 = x⊤ Σx where Σ is the covariance matrix of the returns of the stocks. Let µ denote the vector
of average returns of the individual stocks.
• Let x denote the vector of weights we want to invest in the portfolio and σp,n
2 = x⊤ Σx where Σ is the covariance matrix
of the returns of the stocks. Let µ denote the vector of average returns of the individual stocks.
• The problem is
min x⊤ Σx − µ⊤ x
x
Pn
• We want x = 1 (they are weights) and we do not allow shortselling e.g., xi ≥ 0.
i=1 i

Quadratic programming example II


• Assume n = 3. The problem can be rewritten as:
min 1/2x⊤ Dx − d⊤ x
x

where D = 2Σ and d = µ.
• The constraints can be specified as:
1 1 1 ! = 1
 
x1
1 0 0
x2
≥ 0
0 1 0 ≥ 0
x3
0 0 1 ≥ 0
| {z }
A⊤

Quadratic programming example in R

> Sigma <- matrix(c(0.01, 0.002, 0.002,


+ 0.002, 0.01, 0.002,
+ 0.002, 0.002, 0.01), nrow = 3L)
> mu <- c(0.002, 0.005, 0.01)
> A <- matrix(c(1, 1, 0, 0,
+ 1, 0, 1, 0,
+ 1, 0, 0, 1), ncol = 4L, byrow = TRUE)
> b <- c(1, 0, 0, 0)
> quadprog::solve.QP(2 * Sigma, mu, A, b)
$solution
[1] 0.1042 0.2917 0.6042

$value
[1] -0.002021

$unconstrained.solution
[1] -0.02679 0.16071 0.47321

$iterations
[1] 2 0

$Lagrangian
[1] 0.003667 0.000000 0.000000 0.000000

$iact
[1] 1

162

You might also like