Computerstatistik Skriptum
Computerstatistik Skriptum
Contents
Introduction 2
Data storage 41
Flow control 45
R functions 50
Basic statistics in R 55
Further R topics 81
1
Introduction
What is R
• R was developed by Ross Ihaka and Robert Gentleman (the “R & R’s” of the University of Auckland).
• Ihaka, R., Gentleman, R. (1996): R: A language for data analysis and graphics, Journal of Computa-
tional and Graphical Statistics, 5, 299-314.
• R is a environment and language for data manipulation, calculation and graphical display.
• R is a GNU program. This means it is an open source program (as e.g. Linux) and is distributed for
free.
• R is used by more than 2 million users worldwide (according to R Consortium).
• R was originally used by the academic community but it is currently also used by companies like
Google, Pfizer, Microsoft, Bank of America . . .
R communities
• R has local communities worldwide for users to share ideas and learn.
• R events are organized all over the world bringing its users together:
– Conferences (e.g. useR!, WhyR?, eRum)
– R meetups: check out meetup.com
2
R and related languages
• R can be seen as an implementation or dialect of the S language, which was developed at the AT & T
Bell Laboratories by Rick Becker, John Chambers and Allan Wilks.
• The commercial version of S is S-Plus.
• Most programs written in S run unaltered in R, however there are differences.
• Code written in C, C++ or FORTRAN can be run by R too. This is especially useful for
computationally-intensive tasks.
How to get R
• R is available for most operating systems, as e.g. for Unix, Windows, Mac and Linux.
• R can be downloaded from the R homepage http://www.r-project.org
• The R homepage contains besides the download links also information about the R Project and the R
Foundation, as well as a documentation section and links to projects related to R.
• R is available as 32-bit and 64-bit
• R comes normally with 14 base packages and 15 recommended packages
CRAN
• The R version used in the course is 4.1.2 (as of Winter semester 2021/2022).
R extension packages
• R can be easily extended with more packages, most of them can be downloaded from CRAN too.
Installation and updating of those packages is however also possible with using R itself (18420 are
currently available on CRAN).
• Packages for the analysis and comprehension of genomic data can be downloaded from the Bioconductor
pages (http://www.bioconductor.org).
• but R packages are available form many other sources like R-forge, Github, . . .
Other distributions of R
• As R is open source and published under a GNU license one can make also a own version of R and
distribute it.
• For example Microsoft has Microsoft R Open https://mran.microsoft.com/open
• But there are many others too. We use however here the standard R version from CRAN.
3
What R offers
Therefore is not only a plain statistics software package, but it can be used as one. Most of the standard
statistics and a lot of the latest methodology is available for R.
R screenshot
R console
• R by default has no graphical interface and the so called Console has to be used instead.
• The Console or Command Line Window is the window of R in which one writes the commands and in
which the (non-graphic) output will be shown.
• Commands can be entered after the prompt (>).
• In one row one normally types one command (enter submits the command). If one wants to put more
commands in one row the commands have to be separated by a “;”.
• When a command line stars with a “+” instead of “>”it means that the last submitted command was
not completed and one should finish it now.
• All submitted commands of a session can be recalled with the up and down arrows ↑↓.
4
R as a pocket calculator
> 7 + 11
[1] 18
> 57 - 12
[1] 45
> 12 / 3
[1] 4
> 5 * 4
[1] 20
> 2 ˆ 4
[1] 16
> sin(4)
[1] -0.7568025
• Using the R Console can be quite cumbersome, especially for larger projects. An alternative to the
Command Line Window is the usage of editors or IDEs (integrated development environments).
• Editors are stand-alone applications that can be connected to an installed R version and are used for
editing R source code. The commands are typed and via the menu or key combinations the commands
are submitted. The user has here usually the choice to submit one command at the time or several
commands at once.
• IDEs integrate various development tools (editors, compilers, debuggers, etc.) into a single program -
the user does not have to worry about connecting the individual components
• R has only a very basic editor included which can be started form the menu “File” New script.
• Better editors are EMACS together with ESS, Tinn-R or WinEdt together with R-WinEdt.
These editors offer syntax highlighting and sometimes also templates for certain R structures.
• The most popular IDE is currently probably RStudio.
5
RStudio screenshot
• The main window in RStudio contains five parts: one Menu and four Windows (“Panes”)
• From the drop-down menu RStudio and R can be controlled.
• Pane 1 (top left) - Files and Data: Editing R-Code and view of data sets
• Pane 2 (top right) - Workspace and History:
A more sophisticated example than the previous one will demonstrate some features of R which will be
explained in detail later in the course.
6
> options(digits = 4)
> # setting random seed to get a reproducible example
> set.seed(1)
> # creating data
> eps <- rnorm(100, 0, 0.5)
> eps[1:5]
[1] -0.31323 0.09182 -0.41781 0.79764 0.16475
> group <- factor(rep(1:3, c(30, 40, 30)),
+ labels = c("group 1", "group 2", "group 3"))
> x <- runif(100, 20, 30)
> y <- 3 * x + 4 * as.numeric(group) + eps
> summary(data.ex)
y x group
Min. : 64.7 Min. :20.3 group 1:30
1st Qu.: 74.3 1st Qu.:21.9 group 2:40
Median : 79.5 Median :23.8 group 3:30
Mean : 81.1 Mean :24.4
3rd Qu.: 86.8 3rd Qu.:26.4
Max. :102.0 Max. :29.8
7
20 22 24 26 28 30
90 100
y
80
70
28
x
24
20
3.0
group
2.0
1.0
70 80 90 100 1.0 1.5 2.0 2.5 3.0
80
70
group
Build a linear model:
8
Call:
lm(formula = y ~ x + group)
Coefficients:
(Intercept) x groupgroup 2 groupgroup 3
3.77 3.01 4.07 7.98
Call:
lm(formula = y ~ x + group)
Residuals:
Min 1Q Median 3Q Max
-1.1988 -0.2797 0.0198 0.2792 1.0893
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.7682 0.4288 8.79 6e-14 ***
x 3.0110 0.0169 178.19 <2e-16 ***
groupgroup 2 4.0666 0.1094 37.18 <2e-16 ***
groupgroup 3 7.9754 0.1201 66.38 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
9
Standardized residuals
Residuals vs Fitted Normal Q−Q
61 61
Residuals
2
0.5
−2 0
−1.0
24 14 14 24
70 80 90 100 −2 −1 0 1 2
Standardized residuals
Scale−Location Residuals vs Leverage
61 14
24 70
2
1.0
0
Cook's distance
67
0.0
−3
14
On first sight R looks a bit difficult but already with a few basic commands statistical analyses can be done.
To learn about those commands several sources are available:
• On the R homepage one can find the official manuals under Documentation -> Manuals. Especially
the “An Introduction to R” Manual is recommended.
• “Unofficial” tutorials and manuals, also in other languages than English can be found also on the R
homepage under Documentation -> Other or on CRAN under Documentation -> Contributed. Very
useful from here is the R reference card by Tom Short.
10
R Tutorials for SAS, Stata or SPSS users
A lot of new R users are familiar with SAS, Stata and/or SPSS and therefore special charts for an overview
how to do things they used to do in SAS, Stata or SPSS can be done in R and a extended manual for an
easier move to R are available.
The following references might then be helpful:
• http://r4stats.com
• Muenchen, R.A. (2008): R for SAS and SPSS Users
• Muenchen, R.A. and Hilbe, J. (2010): R for Stata Users
Help within R
• There are three type of help types available in R. They can be accessed via the menu or the command
line. Here only the command line versions will be explained
• Using an internet browser:
> help.start() will evoke an internet browser with links to manuals, FAQs the help pages off all
functions sorted by packages together with an search engine.
• The help command:
> help(command) will show command. A shorter version that does the same is > ?command. For a
few special commands the help works only when the command is quoted, e.g. > help("if")
• The help.search command
> help.search("keyword") one can search all titles and aliases of the help files for keywords. A
shorter version that does the same is > ??keyword. This is however not a full text search.
There are also three other functions useful to learn about functions.
• apropos: apropos("string") searches all functions that have the string in their function name
• demo: The demo function runs some available scripts to demonstrate their usage. To see which topics
have a demo script submit > demo()
• example: > example(topic) runs all example codes from the help files that belong to the topic topic
or use the function topic.
• Also in case you remember the beginning of a function or are just lazy - R has also an auto completion
feature. If you start typing a command and hit tab R will complete the command if there are no
alternatives or will you give all the alternatives.
• R as one of the main statistical software programs has several mailing lists. There are general mailing
lists or lists of special interest groups like a list for mixed effects models or robust statistics (for details
see the R homepage).
• The general mailing list is R-help where questions are normally answered pretty quickly. But make
sure to read the posting guide before you ask something yourself! The R-help mails are also archived
and can be searched.
• Using on the R homepage the search-link will lead to more information on search resources.
• And last but not least, there is also Stack Overflow.
11
R Markdown
• Mixture of Markdown, a markup language for writing documents in plain text, and “chunks” of code
in R or another programming language.
• Then the input is rendered into a document (aka knitted), R runs the code, automatically collects
printed output and graphics and inserts them in the final document.
• In RStudio it can be created using File -> New File -> R Markdown. A window pop-out where you
can choose the different types of output. Once this is chosen (e.g., a pdf document) a new file will
open with a template.
• The first part of the template is called YAML (Yet Another Markup Language) and contains informa-
tion that will be used when rendering your document.
• The actual document starts after the YAML preamble.
The five most used data structures in R can be categorized using their dimensionality and whether all content
must be of the same type, i.e. if they are homogeneous or heterogeneous.
Homogeneous Heterogeneous
1D vector list
2D matrix data frame
nD array
Scalars as on the previous slide are treated as vectors of length 1. And almost all other types of objects in
R are build upon these five structures.
To understand the structure of an object in R the best is to use
str(object)
12
Vectors in R
The most basic structure is a vector. They come as two different flavors:
• atomic vector
• list
• In an atomic vector all elements must be of the same type, whereas in the list the different elements
can be of different types.
• There are four common types for an atomic vector:
– logical
– integer
– double (often refereed to as numeric)
– character
The most direct way to create a vector is the c function where all values can be the entered. The values are
then concatenated.
A single number is also treated like a vector but can be easier assigned to an object:
LogVector
[1] TRUE FALSE FALSE TRUE
IntVector
[1] 1 2 3 4
13
DouVector
[1] 1 2 3 4
ChaVector
[1] "a" "b" "c" "d"
• If used within c() NA will always be coerced to the correct type of the vector.
• To create NAs of a specific type one can use NA_real_, NA_integer_ or NA_character_.
> 1 / 0
[1] Inf
> 0 / 0
[1] NaN
• is.character
• is.double
• is.integer
• is.logical
• is.atomic
typeof(IntVector)
[1] "integer"
typeof(DouVector)
[1] "double"
is.atomic(IntVector)
[1] TRUE
is.character(IntVector)
[1] FALSE
is.double(IntVector)
[1] FALSE
is.integer(IntVector)
14
[1] TRUE
is.logical(IntVector)
[1] FALSE
is.numeric(LogVector)
[1] FALSE
is.numeric(IntVector)
[1] TRUE
is.numeric(DouVector)
[1] TRUE
is.numeric(ChaVector)
[1] FALSE
R has 6 basic data types (the ones shown below + a raw data type used to hold raw bytes).
> z <- 1L
> typeof(z)
[1] "integer"
> k <- 2 + 4i
> typeof(k)
[1] "complex"
Usually, data vectors are not entered by hand in R, but read in as data saved in some other format.
However often vectors with structures are needed and following slides give some useful functions to create
such vectors.
Sequences
To create a vector that has a certain start and ending point and is filled with points that have equal steps
between them, the function seq can be used.
15
x <- seq(from = 0, to = 1, by = 0.2)
x
[1] 0.0 0.2 0.4 0.6 0.8 1.0
y <- seq(length = 6, from = 0, to = 1)
y
[1] 0.0 0.2 0.4 0.6 0.8 1.0
z <- 1:5
z
[1] 1 2 3 4 5
Replications
The function rep can be used to replicate objects in several ways. For details see the help of the function.
Here are some examples
The sample function allows us to obtain of random sample of a specified size from certain elements given in
a vector. The following code corresponds to the results of a 6-sided die:
[1] 1 1 3 1 1 6 6 6
Logical operators in R
Logical vectors are usually created by using logical expressions. The logical vector is of the same length as
the original vector and gives elementwise the result for the evaluation of the expression.
The logical operators in R are:
Operator Meaning
‘==‘ =
‘!=‘ ̸ =
‘<‘ <
‘>‘ >
‘>=‘ ≥
‘<=‘ ≤
Two logical expressions L1 and L2 can be combined using:
16
L1 & L2 for L1 and L2
L1 | L2 for L1 or L2
!L1 for the negation of L1
Logical vectors typically created in the following way:
When one wants to enter a logical vector TRUE can be abbreviated with T and FALSE with F, this is however
not recommended.
Vector arithmetic
Here is an short example for vector arithmetic and the recycling of vectors
x <- 1:4
x
[1] 1 2 3 4
y <- rep(c(1,2), c(2,4))
y
[1] 1 1 2 2 2 2
x ˆ 2
[1] 1 4 9 16
x + y
Warning in x + y: longer object length is not a multiple of shorter object
length
[1] 2 3 5 6 3 4
Taking substrings using substr (alternatively substring can be used but it has slightly different argument):
17
paste(cols, "flowers")
Coercion
• As all elements in an atomic vector must be of the same type it is of course of interest what happens
if they aren’t.
• In that case the different elements will be coerced to the most flexible type.
• The most flexible type is usually character. But for example a logical vector can be coerced to an
integer or double vector where TRUE becomes 1 and FALSE a 0.
• Coercion order: logical -> integer -> double -> (complex) -> character
• Coercion often happens automatically. Most mathematical functions try to coerce vectors to numeric
vectors. And on the other hand, logical operators try to coerce to a logical vector.
• In most cases if coercion does not work, a warning or error message is returned.
• In programming to avoid coercion to a possibly wrong type the coercion is forced using the “as”-
functions like as.character, as.double, as.numeric,. . .
LogVector
[1] TRUE FALSE FALSE TRUE
sum(ChaVector)
Error in sum(ChaVector): invalid 'type' (character) of argument
as.numeric(LogVector)
[1] 1 0 0 1
ChaVector2 <- c("0", "1", "7")
as.integer(ChaVector2)
[1] 0 1 7
18
ChaVector3 <- c("0", "1", "7", "b")
as.integer(ChaVector3)
Warning: NAs introduced by coercion
[1] 0 1 7 NA
Lists
Lists are different from atomic vectors as their elements do not have to be of the same type.
To construct a list one usually uses list.
length(List1)
[1] 4
[[2]]
NULL
Combining Lists
• Several lists can be combined into one list using c.
• If a combination of lists and atomic vectors is given to c then the function will first coerce each atomic
vector to lists before combining them.
19
List4 <- list(a = 1, b = 2)
Vec1 <- 3:4
Vec2 <- c(5.0, 6.0)
List5 <- c(List4, Vec1, Vec2)
List6 <- list(List4, Vec1, Vec2)
str(List5)
List of 6
$ a: num 1
$ b: num 2
$ : int 3
$ : int 4
$ : num 5
$ : num 6
str(List6)
List of 3
$ :List of 2
..$ a: num 1
..$ b: num 2
$ : int [1:2] 3 4
$ : num [1:2] 5 6
More on lists
Attributes
• All objects in R can have additional attributes to store metadata about the object. The number of
attributes is basically not limited. And it can be thought of as a named list with unique component
names.
• Individual attributes can be accessed using the function attr or all at once using the function
attributes.
Attributes examples
20
$attribute1
[1] "I'm a vector"
$attribute2
[1] 3
typeof(attributes(VecX))
[1] "list"
Special attributes in R
In R 3 attributes play a special role and we will came back later to them in more detail and just mention
them now shortly:
• names: the names attributes is a character vector giving each element a name. This will be discussed
soon.
• dimension: the dim for dimension attribute will turn vectors in matrices and arrays.
• class: the class attribute is very important in the context of S3 classes discussed later.
• Depending on the function used attributes might or might not get lost.
• The three special attributes mentioned earlier have special roles and are usually not lost, many other
attributes get however often lost.
attributes(5 * VecX - 7)
$attribute1
[1] "I'm a vector"
$attribute2
[1] 3
attributes(sum(VecX))
NULL
attributes(mean(VecX))
NULL
1. Directly at creation:
21
Nvec2 <- 1:3
Nvec2
[1] 1 2 3
names(Nvec2) <- c("a", "b", "c")
Nvec2
a b c
1 2 3
Properties of names
names(c(a = 1, 2, 3))
[1] "a" "" ""
names(1:3)
NULL
Factors
Categorical data is an important data type in statistics - in R they are usually represented by factors.
A factor in R is basically an integer vector with two attributes:
1. The class attribute which has the value factor and which makes it behave differently compared to
standard integer values.
2. The levels attribute which specifies a set of admissible integers the vector can have.
Factors demo
22
[1] "factor"
levels(Fac1)
[1] "blue" "green"
Levels of a factor
Hence all possible values of a factor should be specified, even when they are not all appearing in the observed
vector. This will also often be more informative when analyzing data.
table(SexCha)
SexCha
male
3
table(SexFac)
SexFac
male female
3 0
• In statistics often one group is used as a reference group and all other groups are compared to this
group.
• To achieve this in R the reference group should be the first level of a factor.
• To change the order of the levels, the function relevel should be used.
23
treat
[1] DRUG2 DRUG2 PLACEBO PLACEBO PLACEBO PLACEBO
Levels: DRUG2 PLACEBO
treat2 <- factor(rep(c(1, 3), c(2, 4)), levels = 1:3,
labels = c("DRUG2", "DRUG1", "PLACEBO"))
treat2
[1] DRUG2 DRUG2 PLACEBO PLACEBO PLACEBO PLACEBO
Levels: DRUG2 DRUG1 PLACEBO
treat3 <- relevel(treat2, ref = "PLACEBO")
treat3
[1] DRUG2 DRUG2 PLACEBO PLACEBO PLACEBO PLACEBO
Levels: PLACEBO DRUG2 DRUG1
Often one observes numeric values for a variable and one wants to categorize it accordingly to its value. This
can easily be done using the function cut.
• Adding a dim attribute to an atomic vector allows it to behave like a multidimensional array.
• A special case of an array is a matrix - there the dimension attribute is of length 2.
• While matrices are an essential part of statistics, arrays are much rarer but are still useful.
• Usually matrices and arrays are not created by modifying atomic vectors but by using the functions
matrix and array.
24
A1 <- array(1:24, dim = c(3, 4, 2))
A1
, , 1
, , 2
• Naturally, also the “length” attribute of a matrix is then two-dimensional. The corresponding functions
are ncol and nrow.
ncol(M1)
[1] 3
nrow(M1)
[1] 2
colnames(M1) <- LETTERS[1:3]
rownames(M1) <- letters[1:2]
M1
A B C
a 1 3 5
b 2 4 6
rownames(M1)
[1] "a" "b"
length(M1) ## number of elements in matrix!
[1] 6
c(M1) ## columns are apended in an atomic vector
[1] 1 2 3 4 5 6
The counterpart of length for an array is dim and the counterpart of names is dimnames which is list of
character vectors of appropriate length.
dim(A1)
[1] 3 4 2
dimnames(A1) <- list(c("r1", "r2", "r3"), c("c1", "c2", "c3", "c4"),
c("a1", "a2"))
A1
, , a1
25
c1 c2 c3 c4
r1 1 4 7 10
r2 2 5 8 11
r3 3 6 9 12
, , a2
c1 c2 c3 c4
r1 13 16 19 22
r2 14 17 20 23
r3 15 18 21 24
• The extension for c for matrices is cbind and rbind. Similarly the package abind provides the function
abind.
• For transposing a matrix in R the function t is available and for the array counterpart the function
aperm.
• To check if an object is a matrix / array the functions is.matrix / is.array can be used.
• Similarly coercion to matrices and arrays can be performed using as.matrix / as.array.
Data frames
The function data.frame can be used to create data frames. Since R 4.0.0 it does not by default convert
character vectors to factors anymore.
Most functions which read external data into R also return a data frame.
26
stringAsFactors
Note that argument stringAsFactors = TRUE provides the old behaviour of automatic conversion.
options(stringAsFactors = TRUE)
Basically a data frame is a list with an S3 class attribute. So “checks” of a data frame yield:
typeof(DF1)
[1] "list"
class(DF1)
[1] "data.frame"
is.data.frame(DF1)
[1] TRUE
Lists, vectors and matrices can be coerced to data frames if it is appropriate. For lists this means that all
objects have the same “length”.
V1 <- 1:5
L1 <- list(V1 = V1, V2 = letters[c(1, 2, 3, 2, 1)])
L2 <- list(V1 = V1, V2 = letters[c(1, 2, 3, 2, 1, 3)])
str(as.data.frame(V1))
'data.frame': 5 obs. of 1 variable:
$ V1: int 1 2 3 4 5
str(as.data.frame(M1))
'data.frame': 2 obs. of 3 variables:
$ A: int 1 2
$ B: int 3 4
$ C: int 5 6
str(as.data.frame(L1))
'data.frame': 5 obs. of 2 variables:
$ V1: int 1 2 3 4 5
27
$ V2: chr "a" "b" "c" "b" ...
str(as.data.frame(L2))
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply dif
• The basic functions to combine two data frames (works similar with matrices) are cbind and rbind.
• When combining column-wise, then the numbers of rows must match and row names are ignored (hence
observations need to be in the same order).
• When combining row-wise the number of columns and their names must match.
• For more advanced combining see the function merge.
Note that cbind (and rbind) try to make matrices when possible. Only if at least one of the elements to be
combined is a data frame the results will be also a data.frame.
Hence vectors can’t usually be combined into a data frame using cbind.
V1 <- 1:3
V2 <- c("a", "b", "a")
str(cbind(V1, V2))
chr [1:3, 1:2] "1" "2" "3" "a" "b" "a"
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:2] "V1" "V2"
28
Special columns in a data frame
• More common than adding a list is to add a matrix to a data frame - also here should the protector
function I be used.
# works:
DF4$b <- list(1:2,1:3,1:4)
DF4
a b
1 1 1, 2
2 2 1, 2, 3
3 3 1, 2, 3, 4
# does not work
DF5 <- data.frame(a = 1:3, b = list(1:2,1:3,1:4))
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply dif
# does work
DF6 <- data.frame(a = 1:3, b = I(list(1:2,1:3,1:4)))
DF6
a b
1 1 1, 2
2 2 1, 2, 3
3 3 1, 2, 3, 4
To work with data subsetting is a key feature. R is really flexible in this regard and has many different
ways to subset the different data structures.
In the following we will discuss the main ways for the main data structures.
29
Subsetting atomic vectors
We will start subsetting atomic vectors as subsetting other structures is quite similar.
There are six ways to subset an atomic vector:
Specifying in square brackets the position of the elements which should be selected.
Specifying in square brackets the positions of the elements which should not be selected.
V1[-c(2, 4, 5)]
[1] 1.0 2.5
V1[c(-1, 2)]
Error in V1[c(-1, 2)]: only 0's may be mixed with negative subscripts
Giving in square brackets a logical vector of the same length means that the elements with value TRUE will
be selected.
# basic version
V1[c(TRUE, FALSE, TRUE, FALSE, FALSE)]
[1] 1.0 2.5
# if the logical vector is too short,
# it will be recycled.
30
V1[c(TRUE, FALSE, TRUE)]
[1] 1.0 2.5 7.2
# most common is to use expression
# which return a logical vector
V1[V1 < 3]
[1] 1.0 2.5 -3.2
Giving in square brackets a character vector of the names which should be selected.
Blank indexing is not useful for atomic vectors but will be relevant for higher dimensional objects. It
returns in this case the original atomic vector.
V1[]
a b c d e
1.0 3.0 2.5 7.2 -3.2
Zero indexing returns in this case a zero length vector. It is often used when generating testing data.
V1[0]
named numeric(0)
Indexing lists
Lists are in general subset quite like atomic vectors. There are however more operators available for subset-
ting:
1. [ ([ ])
2. [[ ([[ ]])
3. $
The first one returns always a list, the other two options extract list components (details will follow later).
31
L1 <- list(a = 1:2, b = letters[1:3], c = c(TRUE, FALSE))
L1[1]
$a
[1] 1 2
L1[[1]]
[1] 1 2
L1$a
[1] 1 2
The most common way is to generalize the atomic vector subsetting to higher dimension by using one of the
six methods described earlier for each dimension.
Here then especially the blank indexing becomes relevant.
We will focus here on matrices, but arrays work basically the same.
M1[-2, ]
a b c
1 3 5
As matrices (arrays) are essentially vectors with a dimension attribute, also a single vector can be used to
extract elements. For this it is important that matrices (arrays) filled in column major order.
32
M2 <- outer(1:5, 1:5, paste, sep = ",")
M2
[,1] [,2] [,3] [,4] [,5]
[1,] "1,1" "1,2" "1,3" "1,4" "1,5"
[2,] "2,1" "2,2" "2,3" "2,4" "2,5"
[3,] "3,1" "3,2" "3,3" "3,4" "3,5"
[4,] "4,1" "4,2" "4,3" "4,4" "4,5"
[5,] "5,1" "5,2" "5,3" "5,4" "5,5"
M2[c(3, 17)]
[1] "3,1" "2,4"
This is rarely done but possible. To select elements from an n-dimensional object, a matrix with n columns
can be used. Each row of the matrix specifies one element. The result will always be a vector. The matrix
can consist of integers or of characters (if the array is named).
Recall that data frames are on the one side lists and on the other side similar to matrices.
If a data frame is subset with a single vector it behaves like a list. If subset with two vectors it behaves like
a matrix.
To select columns:
# like a matrix
DF1[, c("a","c")]
a c
1 4 o
33
2 5 p
3 6 q
# like a list
DF1[c("a","c")]
a c
1 4 o
2 5 p
3 6 q
# like a matrix
DF1[, "a"]
[1] 4 5 6
# like a list
DF1[ "a"]
a
1 4
2 5
3 6
• In general S3 objects consist of atomic vectors, matrices, arrays, lists and so on. And they can be
extracted from the S3 object using the same ways as described above.
• Again, the initial step is to look at str to reveal the details of the object.
set.seed(1)
x <- runif(1:100)
y <- 3 + 0.5 * x + rnorm(100, sd = 0.1)
fit1 <- lm(y ~ x)
class(fit1)
[1] "lm"
str(fit1)
List of 12
$ coefficients : Named num [1:2] 2.982 0.531
..- attr(*, "names")= chr [1:2] "(Intercept)" "x"
34
$ residuals : Named num [1:100] 0.0495 -0.0549 0.0342 -0.1234 0.1549 ...
..- attr(*, "names")= chr [1:100] "1" "2" "3" "4" ...
$ effects : Named num [1:100] -32.572 1.414 0.028 -0.137 0.157 ...
..- attr(*, "names")= chr [1:100] "(Intercept)" "x" "" "" ...
$ rank : int 2
$ fitted.values: Named num [1:100] 3.12 3.18 3.29 3.46 3.09 ...
..- attr(*, "names")= chr [1:100] "1" "2" "3" "4" ...
$ assign : int [1:2] 0 1
$ qr :List of 5
..$ qr : num [1:100, 1:2] -10 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : chr [1:100] "1" "2" "3" "4" ...
.. .. ..$ : chr [1:2] "(Intercept)" "x"
.. ..- attr(*, "assign")= int [1:2] 0 1
..$ qraux: num [1:2] 1.1 1.05
..$ pivot: int [1:2] 1 2
..$ tol : num 1e-07
..$ rank : int 2
..- attr(*, "class")= chr "qr"
$ df.residual : int 98
$ xlevels : Named list()
$ call : language lm(formula = y ~ x)
$ terms :Classes 'terms', 'formula' language y ~ x
.. ..- attr(*, "variables")= language list(y, x)
.. ..- attr(*, "factors")= int [1:2, 1] 0 1
.. .. ..- attr(*, "dimnames")=List of 2
.. .. .. ..$ : chr [1:2] "y" "x"
.. .. .. ..$ : chr "x"
.. ..- attr(*, "term.labels")= chr "x"
.. ..- attr(*, "order")= int 1
.. ..- attr(*, "intercept")= int 1
.. ..- attr(*, "response")= int 1
.. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
.. ..- attr(*, "predvars")= language list(y, x)
.. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
.. .. ..- attr(*, "names")= chr [1:2] "y" "x"
$ model :'data.frame': 100 obs. of 2 variables:
..$ y: num [1:100] 3.17 3.12 3.32 3.34 3.24 ...
..$ x: num [1:100] 0.266 0.372 0.573 0.908 0.202 ...
..- attr(*, "terms")=Classes 'terms', 'formula' language y ~ x
.. .. ..- attr(*, "variables")= language list(y, x)
.. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
.. .. .. ..- attr(*, "dimnames")=List of 2
.. .. .. .. ..$ : chr [1:2] "y" "x"
.. .. .. .. ..$ : chr "x"
.. .. ..- attr(*, "term.labels")= chr "x"
.. .. ..- attr(*, "order")= int 1
.. .. ..- attr(*, "intercept")= int 1
.. .. ..- attr(*, "response")= int 1
.. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
.. .. ..- attr(*, "predvars")= language list(y, x)
.. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
.. .. .. ..- attr(*, "names")= chr [1:2] "y" "x"
35
- attr(*, "class")= chr "lm"
# the intercept
fit1$coefficients[1]
(Intercept)
2.982
# the slope
fit1$coefficients[2]
x
0.5312
# sigma needs to be computed
sqrt(sum((fit1$residuals-mean(fit1$residuals))ˆ2)
/fit1$df.residual)
[1] 0.09411
summary(fit1)
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-0.18498 -0.05622 -0.00871 0.05243 0.25166
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.9821 0.0206 145 <2e-16 ***
x 0.5312 0.0353 15 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
These operators are much more restrictive than their standard counterparts.
36
More on standard subsetting operators
We used already the operators [[ and $ which are frequently used when extracting parts from lists and other
objects.
• [[ is similar to [, but in can only extract single a value/component. Hence only positive integers or a
strings can be used in combination with [[.
• $ is a shorthand for [[ when the component is named.
These operators are mainly used in the context of lists and the difference is that [ returns always a list and
[[ gives the content of the list.
str(L1)
List of 3
$ a: int [1:2] 1 2
$ b: chr [1:3] "a" "b" "c"
$ c: logi [1:2] TRUE FALSE
L1[[1]]
[1] 1 2
L1[1]
$a
[1] 1 2
L1$a
[1] 1 2
str(L1[[1]])
int [1:2] 1 2
str(L1[1])
List of 1
$ a: int [1:2] 1 2
str(L1$a)
int [1:2] 1 2
If [[ is used with a vector of integers or characters then it is assuming nested list structures.
37
Simplification vs preservation
As the different subsetting operators have different properties simplifying or preservation needs to be
remembered at all times as it can have huge impact in programming.
In doubt it is usually better not to simplify. As it is then better that an object is always of the type it was
originally.
To prevent or force simplification, the argument drop can be specified in [.
Simplification Preservation
vector x[[1]] x[1]
list x[[1]] x[1]
factor x[ind, drop=TRUE] x[1]
matrix x[1,] or x[,1] x[ind, , drop = FALSE]
or x[,ind, drop = FALSE]
data frame x[,1] or x[[1]] x[ , 1, drop = FALSE] or x[1]
here ind is an indexing vector of positive integers and naturally arrays behave the “same” as matrices.
Simplification for lists concerns if the result has to be a list or can be of the type of the extracted object.
38
F1 <- factor(c("a", "b", "a"),
levels = c("a","b","c"))
F1
[1] a b a
Levels: a b c
F1[1]
[1] a
Levels: a b c
F1[1, drop = TRUE]
[1] a
Levels: a
droplevels(F1)
[1] a b a
Levels: a b
Simplification for data frames means single columns are returned as vectors and not as data frames.
39
DF1 <- data.frame(a = 1:2, b = letters[1:2])
str(DF1[1])
'data.frame': 2 obs. of 1 variable:
$ a: int 1 2
str(DF1[[1]])
int [1:2] 1 2
str(DF1[ , "a", drop=FALSE])
'data.frame': 2 obs. of 1 variable:
$ a: int 1 2
str(DF1[ , "a"])
int [1:2] 1 2
More on $
More on $ II
All subsetting operators can be combined with assigning values to the selected parts.
x <- 1:6
x
[1] 1 2 3 4 5 6
x[1] <- 20
x
[1] 20 2 3 4 5 6
x[-1] <- 51:55
x
[1] 20 51 52 53 54 55
x[c(1,1)] <- c(-10,-20)
40
x
[1] -20 51 52 53 54 55
## Logical & NA indexing can be combined!
## It is be recycled:
x[c(TRUE,FALSE,NA)] <- 1
x
[1] 1 51 52 1 54 55
Data storage
• While it is possible for a computer to store numbers exactly, it is more common to use approximate
representations.
• R uses double precision floating point numbers for its numeric computations.
• E.g., 123.45 is a decimal floating point number everyone understands to be the same as: 123.45 =
1 · 102 + 2 · 101 + 3 · 100 + 4 · 10−1 + 5 · 10−2 .
• One can also write this as 123.45 = 12345 · 10−2 = 1.2345 · 102 (last is normalized form).
• The sequence of (here, decimal) digits 12345 is called the significand (or mantissa), the 2 is the exponent
(or characteristic) of the number.
• A floating point number system is characterized by four integers: b (base or radix), p (precision), and
emin and emax (minimal and maximal exponents).
• Clearly, all floating point numbers can be represented by the triple (sign, exponent, significand).
• IEEE 754 is a standard for base 2 which says: for double precision, use 64 bits (8 bytes) overall, split
as sign: 1 bit, exponent: 11 bits, significand: 52 bits.
• In principle, the exponent is represented using the biased scheme biased scheme.
– Note: in this scheme, for k = 11 bits, the representation number βk−1 βk−2 ...β0 corresponds to
Pk−1
e = i=0 βi 2i − (210 − 1).
– So the exponent range would be −1023, −1022, . . . , 1023, 1024.
• But, the smallest (all zero’s) and largest (all one’s) exponents are special!
41
• Representing binary floating point numbers in IEEE 754 works as follows:
(a) Exponent neither all 0 bits or all 1 bits: this is the normalized number
δ1 δ52
± 1+ + . . . + 52 2e
2 2
(b) Exponent all 0 bits: this is the de-normalized number
δ1 δ52
± 0+ + . . . + 52 2−1022
2 2
(c) Exponent all 1 bits: if all bits in the significand are 0, this is ±∞; otherwise, it is a NaN.
• Question: Which IEEE 754 floating point number does the following correspond to?
σ 11 . . . 1 0 . . . 0
• Answer: all exponents are 1 and all significands are zero, so (c) on previous slide applies: number is
±∞.
• Question: Which IEEE 754 floating point number does the following correspond to?
σ 00 . . . 0 0 . . . 0
• Answer: all exponents are 0 so this is a denormalized number (see (b) on previous slide) which has
δ0 = 0 and
0 0
± 0 + + . . . + 52 2−1022 = 0
2 2
• Note: This is how we get two zeros (because of the sign bit).
• Question: what is the smallest positive normalized number we can do?
• Answer:
– the exponent should be as small as possible: 000. . . 001 (all zero’s does not work, as the number
is normalized).
– the significand should be as small as possible: 000. . . 000.
0 0
+ 1 + + . . . + 52 2−1022 = 2−1022
2 2
52
1 1 −1022
X
+ 0 + + . . . + 52 2 = 2−i 2−1022 = 2−1022 (1 − 2−52 )
2 2 i=1
42
Rounding effects
• The maximal precision we can expect for floating point computations in R is 16 decimal digits after
the comma (52 binary digits).
• So the basic rule 1 + x = 1 ⇒ x = 0 does not hold in floating point arithmetic! (2−52 is the smallest
positive number greater than 1).
x <- 2ˆ(-52)
1 + x == 1
[1] FALSE
x <- 2ˆ(-53)
1 + x == 1
[1] TRUE
• Consider 5/4 and 4/5. In decimal notation these can be exactly represented as 1.25 and 0.8.
• In binary notation:
n <- 1:10
1.25 * (n * 0.8) == n
[1] TRUE TRUE FALSE TRUE TRUE FALSE FALSE TRUE TRUE TRUE
• To avoid issues:
all.equal(1.25 * (n * 0.8), n)
[1] TRUE
• Rounding errors tend to accumulate so a long series of calculations will result in larger errors than a
shorter one.
[1] 6
43
var(x) # built in
[1] 11
sum((x - mean(x))ˆ2)/10
[1] 11
[1] 11
[1] 11
sum((x - mean(x))ˆ2)/10
[1] 11
[1] -13107
Integer storage
44
• So the 232 = 4294967296 bit sequences have one zero, one NA, and (232 − 2)/2 = 231 − 1 = 2147483647
positive and negative integers each.
• The smallest such integer is −(231 − 1), the largest is 231 − 1.
as.integer(2 ˆ 31 - 1) # works
[1] 2147483647
[1] NA
Flow control
Flow control
• Many problems are often of a repetitive nature and solutions are not obtained in a single instance but
certain steps need to be repeated.
• For example
• For this flow control R offers different constructs which we will introduce in the following slides.
for loop
• The for() statement in R specifies that certain statements are to be repeated a fixed number of times.
• The syntax looks like:
• This means that the variable index runs through all elements in vector. For each value then in vector
the statements are executed.
• If for each value a result is created which should be stored, then it is recommended to create first an
object of the appropriate length which is used to store the results.
Fibonacci numbers
To compute in R the first 10 Fibonacci numbers we can use a for loop in the following way:
45
Fib <- numeric(10) ## create a vector which will store numeric elements
Fib[1] <- 1
Fib[2] <- 1
for (i in 3:10) {
Fib[i] <- Fib[i-1] + Fib[i-2]
}
Fib
[1] 1 1 2 3 5 8 13 21 34 55
if statement
• The if statement can be used to control whether and when certain statements are to be executed.
• There are two versions:
if (condition) {
statements when condition is TRUE
}
or
if (condition){
statements when TRUE
} else {
statements when FALSE
}
if else example
x <- 3
if (x < 5) {
print("'x' is smaller than 5")
} else {
print("'x' is at least 5")
}
while loop
• The while loop can be used when statements have to be repeated but is not known in advance how
often exactly. The computations should be continued as long as a condition is fullfilled.
• The syntax looks like:
while (condition) {
statements
}
• Hence here condition is evaluated and if FALSE nothing will be done. If the condition is however TRUE,
then the statements are executed. After the statements are executed, the condition is again evaluated.
46
Fibonacci numbers II
To compute for example all Fibonacci numbers smaller than 100 we could use
Fib1 <- 1
Fib2 <- 1
Fibs <- c(Fib1)
while (Fib2 < 100) {
Fibs <- c(Fibs, Fib2)
oldFib2 <- Fib2
Fib2 <- Fib1 + Fib2
Fib1 <- oldFib2
}
Fibs
[1] 1 1 2 3 5 8 13 21 34 55 89
Note: increasing the length of a vector can be costly for R! Avoid if possible.
repeat loop
• If a loop is needed which does not go through a prespecified number of iterations or should not have
a condition check at the top the repeat loop can be used.
• The syntax looks like:
repeat {
statements
}
• This causes the statement to be repeated endlessly. Therefore a terminator called break needs to be
included. It is usually included as:
if (condition) break
• In general the break command can be used in any loop and it causes the loop to terminate immediately.
• Similarly, the command next can also be used in any loop and causes that the computations of the
current iteration are terminated immediately and the next iteration is started from the top.
• The repeat loop and the functions break and next are rarely used since it is much easier to read and
understand programs using the other looping methods.
To compute for example all Fibonacci numbers smaller than 100 we could use also
47
Fib1r <- 1
Fib2r <- 1
Fibsr <- c(Fib1r)
repeat {
Fibsr <- c(Fibsr, Fib2r)
oldFib2r <- Fib2r
Fib2r <- Fib1r + Fib2r
Fib1r <- oldFib2r
if (Fib2r > 100) break
}
Fibsr
[1] 1 1 2 3 5 8 13 21 34 55 89
switch
• Another possibility for conditional execution is the function switch. It is especially useful when the
there are more than two possibilities or if the options are named.
• The basic syntax is
switch(EXPR, options)
where EXPR can be an integer value which says which option should be chosen, alternatively it can be a
character string if the options are named.
switch examples I
[1] 11
[1] 12
NULL
switch examples II
48
SUM <- function(x, type = "L2") {
switch(type,
L2 = {LOC <- mean(x)
SCA <- sd(x)},
L1 = {LOC <- median(x)
SCA <- mad(x)}
)
return(data.frame(LOC = LOC, SCA = SCA))
}
set.seed(1); x <- rnorm(100)
SUM(x)
LOC SCA
1 0.1089 0.8982
LOC SCA
1 0.1139 0.87
• A function not directly connected to the previous flow control but still useful is ifelse.
• The basic syntax is
• This function is usually used when EXPR is a vector. The result is a vector of same length as EXPR that
has as corresponding entry the value of yes if EXPR is TRUE, of no if EXPR is FALSE. Missing values in
EXPR remain missing values.
• Note that ifelse will try to coerce EXPR to logical if it is not. Also the attributes from EXPR will be
kept and only the entries replaced.
ifelse example
49
R functions
Functions in R
• Functions are fundamental building blocks in R and are self contained units of code with a well-
defined purpose.
• To create a function function() is used. The parentheses enclose the arguments list. Then a single
statement or multiple statements enclosed by {} are specified.
• When R executes a function definition it produces an object with three parts:
When printing the function it will display these parts. (If the environment is not shown it is the global
environment)
To reduce the burden for the user, one can give default values to some arguments:
f <- function(x, y = 1) {
z <- x + y
2 * z
}
f
function(x, y = 1) {
z <- x + y
2 * z
}
formals(f)
$x
$y
[1] 1
50
body(f)
{
z <- x + y
2 * z
}
environment(f)
<environment: R_GlobalEnv>
Primitive functions
• There is one exception of a group of functions which have not the three parts just described - these
are called Primitive functions.
• All primitive functions are located in the base package. They call directly C code and do not contain
any R code.
sum
function (..., na.rm = FALSE) .Primitive("sum")
formals(sum)
NULL
body(sum)
NULL
environment(sum)
NULL
• To demonstrate how some operators are actually functions check the following code:
x <- 10
y <- 20
x + y
[1] 30
'+'(x, y)
[1] 30
Scope of variables
51
• In R scope is controlled by the environment of the functions.
f <- function(x, y = 1) {
z <- x + y
2 * z
}
z
[1] 2 2 2 1 1 1 1 1
Lazy evaluation
• In the standard case, R arguments are lazy - they are only evaluated when they are actually used.
• To force an evaluation you have to use the function force.
• This also allows us to specify default values in the header of the function for variables which are created
locally.
f1 <- function(x) 10
f2 <- function(x) {
force(x)
10
}
f1(stop("You made an error!"))
[1] 10
f2(stop("You made an error!"))
Error in force(x): You made an error!
Calling functions
52
Calling functions examples
str(f(1, 2, 3))
List of 3
$ pos1: num 1
$ pos2: num 2
$ pos3: num 3
Functions returns
• Functions in general can return only one object as a rule. Which is however not a real restriction as
all the desired output can be collected into a list.
• The last expression evaluated in a function is by default the returned object.
53
• Whenever the function return(object) is called within a function, the function is terminated and
object is returned.
f1 <- function(x) {
if (x < 0) return("not positive")
if (x < 5) {
"between 0 and 5"
} else {
"larger than 5"
}
}
f1(-1)
[1] "not positive"
f1(1)
[1] "between 0 and 5"
f1(10)
[1] "larger than 5"
Invisible return
It is possible to return objects from a function call which are not printed by default using the invisible
function.
Invisible output can be assigned to an object and/or forced to be printed by putting the function call between
round parentheses.
f1 <- function() 1
f2 <- function() invisible(1)
f1()
[1] 1
f2()
• The magrittr package defines the pipe operator %>% and many other packages also make use of it.
• Rather than typing f(x, y) we type x %>% f(y) (start with x then use f(y) to modify it).
• R 4.1.x contains a base R pipe |> with the same syntax:
54
x <- 1:4; y <- 4
sum(x, y)
[1] 14
x |> sum(y)
[1] 14
x |> mean()
[1] 2.5
Basic statistics in R
• The following slides give some first vocabulary how to do basic statistics in R and how to formulate
statistical models in R.
• The usage of those functions will be demonstrated using the crabs dataset from the package MASS.
The crabs dataset of the package MASS contains 8 variables measured on 200 crabs. The variables are:
> library(MASS)
> data(crabs)
> # ?crabs would show the help file for the dataset
> str(crabs)
'data.frame': 200 obs. of 8 variables:
$ sp : Factor w/ 2 levels "B","O": 1 1 1 1 1 1 1 1 1 1 ...
$ sex : Factor w/ 2 levels "F","M": 2 2 2 2 2 2 2 2 2 2 ...
$ index: int 1 2 3 4 5 6 7 8 9 10 ...
$ FL : num 8.1 8.8 9.2 9.6 9.8 10.8 11.1 11.6 11.8 11.8 ...
55
$ RW : num 6.7 7.7 7.8 7.9 8 9 9.9 9.1 9.6 10.5 ...
$ CL : num 16.1 18.1 19 20.1 20.3 23 23.8 24.5 24.2 25.2 ...
$ CW : num 19 20.8 22.4 23.1 23 26.5 27.1 28.4 27.8 29.3 ...
$ BD : num 7 7.4 7.7 8.2 8.2 9.8 9.8 10.4 9.7 10.3 ...
• The classical summary statistics for numeric data are the mean, the standard deviation or variance,
correlation and covariance matrix. Other measures are the median and quantiles as well as the extreme
values.
• A good overview is provided in R using summary.
• The mean and median can be also obtained using functions of the same name.
• The functions for variance and standard deviation have also the obvious function names var and sd.
• Quantiles can be calculated using the quantile function.
> summary(crabs)
sp sex index FL RW CL
B:100 F:100 Min. : 1.0 Min. : 7.2 Min. : 6.5 Min. :14.7
O:100 M:100 1st Qu.:13.0 1st Qu.:12.9 1st Qu.:11.0 1st Qu.:27.3
Median :25.5 Median :15.6 Median :12.8 Median :32.1
Mean :25.5 Mean :15.6 Mean :12.7 Mean :32.1
3rd Qu.:38.0 3rd Qu.:18.1 3rd Qu.:14.3 3rd Qu.:37.2
Max. :50.0 Max. :23.1 Max. :20.2 Max. :47.6
CW BD
Min. :17.1 Min. : 6.1
1st Qu.:31.5 1st Qu.:11.4
Median :36.8 Median :13.9
Mean :36.4 Mean :14.0
3rd Qu.:42.0 3rd Qu.:16.6
Max. :54.6 Max. :21.6
56
0% 20% 40% 60% 80% 100%
6.50 10.68 11.96 13.50 14.82 20.20
> table(crabs$sex)
F M
100 100
> tab <- table(crabs$sex, crabs$sp)
> tab
B O
F 50 50
M 50 50
> prop.table(tab) # total percentages
B O
F 0.25 0.25
M 0.25 0.25
> prop.table(tab, 1) # row percentages
B O
F 0.5 0.5
M 0.5 0.5
> prop.table(tab, 2) # column percentages
57
B O
F 0.5 0.5
M 0.5 0.5
The function by
• The function by is a very nice wrapper of the function tapply when using data frames. It can apply to all
variables of the data set the function intended for each unique level of an indicator variable.
• Probably the easiest way to get a nice groupwise summary for a data frame. Note however that the function
must work on data frames!
58
: M
FL RW CL CW BD
16.63 12.26 33.69 37.19 15.32
with(DATA, function(var.name,...))
For example:
Statistical models in R
Summary statistics give only a glimpse at the data and often of inference and/or modeling is the actual goal of the
analysis. R provides a lot of statistical tests as well as a lot of modeling functions. Before we can however use them
we have to learn something about R’s formulae definitions to be able to define models in R.
A basic formula in R has the form
y ~ x1 + x2 + x3
where the part left of ~ is the dependent variable and the right part defines the independent variables.
59
Formulae and intercept
The intercept in a model formula is represented by a 1. By default R assumes that an intercept is present, therefore
mentioning the intercept or not makes no difference. If however the intercept should be removed a -1 is needed in
the formula.
These two models are equivalent, both have an intercept:
y ~ x1 + x2 and y ~ x1 + x2 + 1
The same model without intercept must be defined as:
y ~ x1 + x2 - 1
• :
Used for interactions like x1 : x2
• *
Main effects plus interactions, like x1 * x2 = x1 + x2 + x1 : x2.
• ˆ
Factor crossing up to a certain degree, like (x1 + x2 + x3)ˆ2 = x1 + x2 + x3 + x1:x2 + x1:x3 + x2:x3.
• -
Removing terms, like (x1 + x2 + x3)ˆ2 - x2:x3 = x1 + x2 + x3 + x1:x2 + x1:x3.
• y ~ I(x1 - 1) extracts from x1 one unit before it enters the model and not the intercept. This is therefore
different from y ~ x1 - 1.
• y ~ I(x1ˆ2) squares variable x1 and has nothing to do with factor crossing.
60
Different graphic systems in R
The different graphic systems in R are:
> demo("graphics")
> library(lattice)
> demo("lattice")
• plot(x,y)
produces a scatter plot if x and y are numeric.
61
• plot(X)
produces a scatter plot matrix if X is a data frame.
• plot(x) produces a scatter plot of x against its index vector if x is numeric.
• plot(x)
produces a bar plot if x is a factor.
• plot(x,y)
produces a spine plot if x and y are factors.
• pairs(X)
produces a scatter plot matrix if X is a matrix or data frame.
• coplot(x1 ~ x2 | x3)
produces a number of scatterplots of x1 against x2, given values of x3 (in case x3 is a factor it produces a
scatter plot for each factor level)
• matplot(X,Y)
plots the columns of the matrix X against the columns of matrix Y.
• image(x,y,z)
plots a grid of rectangles along the ascending x, y values and fills them with different colours to represent the
values of z.
• contour(x,y,z)
draws a contour plot for z.
• persp(x,y,z)
draws a 3D surface for z.
62
Special statistical high-level plotting functions II
• dotchart(x)
plots a dotchart.
• stripchart(x)
produces a 1D scatterplot.
• boxplot(x) produces a boxplot.
• pie(x)
produces a pie chart.
• curve(expr)
draws the given expression.
• add=TRUE
forces the function to act like a low-level plotting command, “adds” the plot to an already existing one.
• axes=FALSE
suppresses axis, useful when custom axes are added.
• log="x", "y" or "xy"
Logarithmic transformation of x, y or both axes.
• type=
controls the type of the plot, the default is points.
Types of plots
The default for the type= argument is "p" which makes individual points for each observations. Other options for
this argument are:
63
Low-level plotting functions
Sometimes the result form the high-level plotting commands need to get “improved”. This can be done by low-level
plotting commands which add additional information (like extra points, lines, legend, . . . ) to an existing plot.
There are plenty of low-level plotting commands available.
In the following only a few of them will be introduced.
Note:
Polygons can be added with the function polygon.
Adding text
The function text adds text to a plot at specified coordinates. Usage:
text(x,y,labels,...)
Which means that labeli is put to the position (xi , yi ).
A common application for this is:
plot(x, y, type = "n")
text(x, y, names)
Note:
Also mathematical symbols and formulae can be added as text, then the labels are rather expressions. For details
see help for plotmath.
Adding a legend
The function legend adds a legend to a specified position in the plot.
Usage:
legend(x,y,legend,...)
In order to let R know what is the “connection” to the graph, at least one of the following options has to be specified.
The specification v must have the same length as legend.
64
Customizing axes
In R one can add several axes to a plot. The function to use is axis. You can specify for the axis the side, position,
label, tick and so on.
Usage:
axis(side,...)
The side of the plot is defined this way:
1=below, 2=left, 3=above and 4=right.
This function is mainly used when in the high-level plotting function the argument axes was set to FALSE.
Note:
If one wants ticks at an axis of a 1D plot for every observed value the function rug can be used.
Graphic parameters
Always when a graphic device gets activated, a list of graphical parameters is activated. This list has certain default
settings. Those default settings are often however not satisfying and should be changed.
Changes can be done permanently in order to affect all plotting functions submitted to that device or only for one
plotting function call. With graphical parameters one can change almost every aspect of a graphic. All graphic
parameters have a name. In the following, some of them are introduced.
> par()
gives a list with all graphical parameters and their current settings.
65
> plot(x,y, pch="*")
• lab=c(x,y,n)
x specifies the number of ticks at the x-axis, y at the y-axis, n the length of the tick labels in characters
(including decimal point).
• las=
orientation of axis labels (0=parallel, 1=horizontal, 2=perpendicular).
• mgp=c(d1,d2,d3)
positions of axis components (details see manual).
• tck=
length of the tick marks.
• xaxs=
style of the x-axis (possible settings, “s”, “e”, “i”, “r”, “d”) y-axis analogous.
Figure margins
A single plot in R is called a figure. A figure contains as well the actual “plotting area” as the surrounding margins.
The borderline between margin and plotting area are normally the axes. The margins contain the labels, titles and
so on.
A graph of the plotting area can be seen on the next slide.
There are two arguments to control the margins. The argument mai sets the margins measured in inches, whereas
the argument mar measures them in number of text lines. The margins themselves are divided into four parts: the
bottom is part 1, left part 2, top part 3 and light part 4. The different parts are addressed with the corresponding
index of the margin vector.
For instance:
mai=c(1,2,3,4) (1 inch bottom, 2 inches left, 3 inches top, 4 inches right)
mar=c(1,2,3,4) (1 line bottom, 2 lines left, 3 lines top, 4 lines right)
66
Figure regions
Device drivers
R can create for almost all types of driver displays or printing devices graphics. However, R has to be told before
making the figure, which device should be applied - therefore the device driver has to be specified.
help(Devices)
provides a list with all possible devices. The special device of interest is activated by calling its name and specifying
the necessary options in the parentheses.
For instance:
> jpeg(file="figure.jpg",
+ width=5, height=4, bg="white")
67
Figure 2: Taken from the R Introduction manual.
> dev.off()
Plotting example I
The following plot should give an impression of the colours, symbols and point sizes in R.
> plot(1,1,xlim=c(1,10),ylim=c(0,5),type="n")
> points(1:9,rep(4.5,9),cex=1:9,col=1:9,pch=0:8)
> text(1:9,rep(3.5,9),labels=paste(0:9),cex=1:9,col=1:9)
> points(1:9,rep(2,9),pch=9:17)
> text((1:9)+0.25,rep(2,9),paste(9:17))
> points(1:8,rep(1,8),pch=18:25)
> text((1:8)+0.25,rep(1,8),paste(18:25))
68
Plotting example I - the plot
3 45 678
4
0
9 1 2
3
1
9 10 11 12 13 14 15 16 17
2
18 19 20 21 22 23 24 25
1
0
2 4 6 8 10
Plotting example II
This plot is about putting two figures into one window.
69
Plotting example II - the plot
0.6
0.5
0.5
0.4
0.4
Density
Density
0.3
0.3
0.2
0.2
0.1
0.1
0.0
−3 −1 0 1 2 3 0.0 −3 −1 0 1 2 3
x N = 80 Bandwidth = 0.3046
70
Plotting example III - the plot
7
6
5
Gas
4
3
2
Before After
Insul
Workflow
Before we can perform the statistical analysis, steps are required to bring the data into a decent format and to get
it ready for the analysis:
Note: Step 2-7 need not be done in order or can be done repeatedly.
Datasets available in R
Base R and a lot of add on packages have build in datasets (i.e., data.frame objects) to demonstrate the usage of
functions.
Those datasets can be loaded using the function
> data(foo)
71
This function searches following the search path for a dataset with the corresponding name.
A list of all datasets currently available can retrieved submitting only
> data()
• scan Most flexible function, all the following functions are based on this function.
• read.table The probably user friendliest function to read tabular data, this function will be used in this course.
• read.csv Same as read.table but different default values.
• read.csv2 Same as read.table but different default values.
• read.delim Same as read.table but different default values.
• read.delim2 Same as read.table but different default values.
Sometimes read.table makes unfortunately rather strange conversions for the different variables.
In that case the following arguments of read.table are useful:
• as.is Should the function really try to convert the variables to the “right” format?
• colClasses If you know the format of each class in advance, you can also specify them here.
72
data.table
Especially for large data sets data frames are not very suitable. The package data.table provides for examples for
data sets of sizes 100GB an own infrastructure to deal with that.
The corresponding function to read in the data is then fread.
In the following we will assume however a data frame.
Data preprocessing
• In the research process doing the statistical analysis takes often less time than data preprocessing.
• Preprocessing in this context means for example transformations of variables, sorting according to variables,
combining different data frames or splitting data frames.
• R might not be the most convenient tool for data preprocessing (sorry!). But it still offers a lot of tools and
most operations can be made with it.
• This section of the lecture deals with data manipulation for data frames and uses methods that are provided by
the base distribution of R though for example packages like reshape help making some transformations easier.
Variable names
For large data sets it is sometimes useful to see the variable names of a data frame. Or sometimes one even wants to
change those names. There are several ways to do this. One way is the function names.
73
> dataF1 <- data.frame(V1 = 1:3, V2 = rnorm(3),
+ V3 = factor(c(1, 2, 1)))
> names(dataF1) # gets the names
[1] "V1" "V2" "V3"
> names(dataF1) <- c("v1","v2","v3") # overwrites the
> # current names
> row.names(dataF1)
[1] "1" "2" "3"
Note: rownames and colnames are for matrices. row.names and names are for data frames. But both versions can be
used.
74
The function merge II
• The function merge(x, y, ...) performs the operations know in database management systems (e.g., SQL)
as JOIN:
75
The function reshape
• A special case of data is longitudinal data (panel data).
• There are two ways how you can get the data, once with the repeated measurements in individual columns and
once below each other.
• Depending on the analysis you might need one or the other form. The function reshape can change this.
76
The function split
If for example for each level of a factor a separate data.frame is wished, the function split can be used. It saves
however the result in a list and the individual data frames must be extracted from there.
$b
V1 V2 V3
2 2 -0.2791 b
> library(MASS)
> data("crabs", package = "MASS")
> subset(crabs, RW >= 15.3 & sp == "B",
+ select = FL:BD)
FL RW CL CW BD
44 18.8 15.8 42.1 49.0 17.8
47 19.7 15.3 41.9 48.5 17.8
50 21.3 15.7 47.1 54.6 20.0
97 16.7 16.1 36.6 41.9 15.4
98 17.4 16.9 38.2 44.1 16.6
99 17.5 16.7 38.6 44.5 17.0
100 19.2 16.5 40.9 47.9 18.1
77
• The difference between the two is, that the changes of edit have to be stored in a new dataset and fix can
overwrite the current dataset.
• Both functions will open a new window where you can edit single cells, change variables names or determine
the type of the variable.
Missing data
• In R, missing values are represented by the symbol NA (not available).
• Often the result of an operation in which NA occurs is also set to NA.
• Many functions and procedures have an argument for handling NAs (na.rm), which if it is set to TRUE excludes
the NA observations from the respective calculation.
• Note: This corresponds to the standard procedure of many statistics programs, but may lead to different
samples in the calculations.
• Most standard models cannot deal with missing values (exceptions: boosting, decision trees. . . ).
• In any case, missing values must be investigated before an analysis can be performed.
• Options:
Missing data in R
• In R, the function is.na() can be used on a vector. matrix or data frame to check which elements are NA:
> data("airquality")
> colMeans(is.na(airquality))
Ozone Solar.R Wind Temp Month Day
0.24183 0.04575 0.00000 0.00000 0.00000 0.00000
• Also, the function complete.cases() returns a logical which is TRUE if the row contains no NAs
78
> # number of complete observations/rows
> sum(complete.cases(airquality))
[1] 111
> ddf <- data.frame(x = c(1, NA, 3), y = c(11, 10, NA))
> ddf
x y
1 1 11
2 NA 10
3 3 NA
> ddf[is.na(ddf)] <- 0
> ddf
x y
1 1 11
2 0 10
3 3 0
> ddf <- data.frame(x = c(1, NA, 3), y = c(11, 10, NA))
> ddf$x[is.na(ddf$x)] <- mean(ddf$x, na.rm = TRUE)
> ddf$y[is.na(ddf$y)] <- mean(ddf$y, na.rm = TRUE)
> ddf
x y
1 1 11.0
2 2 10.0
3 3 10.5
79
Question: Would you use the mean or the median for imputation for the airquality data? How could you decide?
– univariate - observations that lie outside 1.5*IQR (IQR is “Inter Quartile Range” is the difference between
75th and 25th quantiles); in boxplot can be visualized by the points outside the whiskers.
– multivariate
∗ defined within the scope of a model (e.g., based on Cook’s distance, which we will encounter in the
regression chapter).
∗ observations which are anomalous based on all the variables under investigation (detected using
unsupervised learning algorithms for anomaly detection)
Outlier handling
• Elimination (not advised!)
• Imputation - same as missing values
• Capping - e.g., setting all values above (below) a certain quantile to the value of a quantile.
• Use methods in the statistical analysis which are robust to this issue.
117
150
62
100
50
0
80
> airquality$Ozone[which(airquality$Ozone %in% bxp$out)] <-
+ quantile(airquality$Ozone, 0.95, na.rm = TRUE)
Further R topics
> search()
[1] ".GlobalEnv" "whiteside" "package:MASS"
[4] "package:stats" "package:graphics" "package:grDevices"
[7] "package:utils" "package:datasets" "package:methods"
[10] "Autoloads" "package:base"
Packages for R
• R is open source software and users submit all the time new functions. These functions are normally submitted
as packages.
• The base version of R comes however only with a few selected packages. Other packages must be first installed
and the easiest way is to use the menu for it.
• But though the packages are installed they are still not available for the user at the beginning of an R session
(Besides a few basic packages which are loaded automatically). Add on packages should be loaded by the user
when they need them.
• If a package is loaded can it be seen in the search path.
• Packages can be loaded using the menu or as
> library(foo)
• Sometimes it is also necessary to remove packages from the search path. This can be done by submitting
> detach("package:foo")
81
Citing R
• R comes for free and a lot of people contribute to it. They don’t want any money from you when you use it,
they however would like to be acknowledged when you are using their work.
• Therefore it is appreciated if you cite R and special packages when you use them for your work. If you want
to know how R or packages want to be cited, use the function citation.
• For R in general:
> citation()
@Manual{,
title = {R: A Language and Environment for Statistical Computing},
author = {{R Core Team}},
organization = {R Foundation for Statistical Computing},
address = {Vienna, Austria},
year = {2021},
url = {https://www.R-project.org/},
}
• For R packages:
> citation("MASS") # for citing packages, in this case the package MASS
@Book{,
title = {Modern Applied Statistics with S},
author = {W. N. Venables and B. D. Ripley},
publisher = {Springer},
edition = {Fourth},
address = {New York},
year = {2002},
note = {ISBN 0-387-95457-0},
url = {https://www.stats.ox.ac.uk/pub/MASS4/},
}
82
• Unless otherwise specified in global settings, R will ask before it is closed if the current working space should
be saved. In that case it will load the saved working space at the start of the next session.
• Saving the whole workspace is typically not recommended. See e.g., discussion here.
• Objects saved on previous sessions can be loaded into new session using load.
Working directory
• The working directory is the path where R will search by default for files to read or where R will by default
save files.
• The current working directory can be obtained or changed using functions getwd() and setwd() or the menu.
> getwd()
[1] "/Users/lauravanagur/Documents/Teaching/CompStat"
> # try, but does not work in Rmarkdown
> # setwd("/Users/lauravanagur/Documents/")
> # getwd()
• When opening a file with RStudio, it automatically sets the working directory to the location of the file. In
Rmarkdown it is automatically the location of the .Rmd.
> ## getwd()
> ## "/Users/lauravanagur/Documents/Teaching/CompStat/Slides"
> dat <- read.csv("Practicals/Datasets/dat.csv")
Scripts
• Scripts written in editors are usually saved in files with ending .r or .R.
• These files can be loaded from within R.
• To load a whole script the function ‘source” is used.
83
> ## again, with relative path...
> source("Rscript.R")
• The source command will by default create all objects that are defined in the file however produces no output.
Output will only be produced if the the object in the file is forced to be printed using the print function.
– Standard calendar is complicated (leap years, months of different lengths, historically different calendars
- Julian vs. Gregorian).
– Times depend of an unstated time zone (add daylight savings :-() and some years have leap seconds to
keep the clocks consistent with the rotation of the earth!
• R can flexibly handle dates and times and has different classes for them with different complexity levels.
• Most classes offer then also arithmetic functions and other tools to work with date and time objects.
• A good overview over the different classes is given in the Helpdesk section of the R News 4(1).
• The builtin as.Date() function handles dates (without times).
• The contributed library chron handles dates and times, but does not control for time zones.
• The POSIXct and POSIXlt classes allow for dates and times with control for time zones.
• The various as. functions can be used for converting strings or among the different date types when necessary.
> Sys.Date()
[1] "2022-01-20"
as.Date() function
• The as.Date() function allows for a variety of formats through the format= argument.
84
Code Value
%d Day of the month (decimal number)
%m Month (decimal number)
%b Month (abbreviated)
%B Month (full name)
%y Year (2 digit)
%Y Year (4 digit)
%C Century
85
• DDMMYY gives the date of birth.
• C specifies the century of birth. + = 19th Cent., - = 20th Cent. and A = 21st Cent.
• ZZZ is the personal identification number. It is even for females and odd for males
• Q is a control number or letter to see if the total number is correct.
Debugging
There are two commonly referred claims:
1. Programmers spent more time on debugging their own code that actually programming it.
2. In every 20 lines of code is at least one bug.
Hence debugging is an essential part of programming and there are strategies and tools available in R to do this well
in R.
Top-down programming
• General agreement is that good code is written in a modular manner. This means when you have a procedure
to implement, you decompose it into small parts where each part will become an own function.
• Then the main function is “short” and will consist mainly of calling these subfunctions.
• Naturally also within these functions the same approach is to be taken.
• Then same approach is followed in debugging. First the top level function is debugged and all subfunctions are
assumed correct. If this does not yield a solution, then the next level is debugged and so on.
86
Small start strategy
• The small start strategy in debugging suggests to start using small test cases for the debugging.
• Once these work fine, then consider larger testing cases.
• At that stage also extreme cases should be tested.
Antibugging
• Also some antibugging strategies are useful in this context.
• Assume that at line n in your code you know that variable or vector x must have some specific property, like
being positive or sum up to 1.
• Then you can add in that line in the code for debugging purposes for example
or
> stopifnot(sum(x) == 1)
• browser
• debug and undebug
• debugger
• dump.frames
• recover
• trace and untrace
For details about these functions see their help pages. In the following we will look only at debug and traceback.
Note that also Rstudio offers special debugging tools, see https://support.rstudio.com/hc/en-us/articles/205612627-
Debugging-with-RStudio for details.
traceback
• Often when using functions and error occurs it is not really clear where the actually error occurs, which
(sub)function caused the error
• One strategy then is to use the traceback function, which returns when called directly after the erroneous call
the sequence of function calls which lead to the error.
traceback II
87
> f1 <- function(x) f2(x)ˆ2
> f2 <- function(x) log(x) + "x"
> mainf <- function(x) {
+ x <- f1(x)
+ y <- mean(x)
+ y
+ }
> mainf(1:3)
> traceback()
debug
Assume you have a function foo you assume faulty. Using then
> debug(foo)
will open whenever the function is called the “browser” until either the function is changed or the debugging mode
terminated using
> undebug(foo)
In the “browser” the function will be executed line by line where always the next to be executed line will be
shown.
• n (or just hitting enter) will execute the line shown and then present the next line to be executed.
• c this is almost like n just that it might execute several lines of code at once. For example if you are in a loop
then c will jump to the next iteration of the loop.
• where this prints a stack trace, the sequence of function calls which led the execution to the current location
• Q this quits the browser.
And in the browser mode any other R command can be used. However to see for example the value of a variable
nthe variable needs then to explicitly printed using print n.
Debugging demo
In a demo we will go through the following function in debugging mode
88
+ }
+ print(paste(i,Sys.time()))
+ }
+ return(RES)
+ }
> debug(SimuMeans)
> SimuMeans(5)
Capturing errors
• Especially in simulations it is often desired that when an error occurs that not the whole process is terminated
but that the error is catched and an appropriate record made but otherwise the simulations should continue.
• R has for this purpose the function try and tryCatch where we will consider only tryCatch.
• The idea of tryCatch is to run the “risky” part where errors might occur with in the tryCatch call and tell
tryCatch what to return in the case of an error.
89
+ }
> SimuMeans3(5)
[,1] [,2] [,3]
[1,] 0.10889 -0.29099 1.1103
[2,] -0.04921 -0.17200 0.8624
[3,] NA -0.02305 1.0302
[4,] -0.09209 -0.27303 1.0814
[5,] -0.05374 0.13526 1.0200
Profiling
• If you know that your function is correct but think it is slow you can do profiling which helps to identify the
parts of the functions which are bottlenecks and then you can consider if these parts could be improved.
• The idea in profiling is that the software checks in very short intervals which function is currently running.
• The main functions in R to do profiling are Rprof and summaryRprof. But there are also many other specialized
packages for this purpose.
A function to profile
A function to profile II
Run it on your own computer and look at the full output of summaryRprof().
Package microbenchmark
• The contributed package microbenchmark is useful in comparing the speed of different functions.
90
• The microbenchmark() function serves as a more accurate replacement of the often seen system.time().
Regression modeling in R
The following chapter give a small glimpse about the linear regression model in R.
There are many options (functions) in R available for other regression models (e.g., generalized linear models, pe-
nalized regression models etc.). We focus here only on basic linear regression. But many principles apply also when
using functions for other regression models.
Here some useful functions and packages for regressions in R:
aov ANOVA models in R
lm linear regression
glm generalized linear models like logistic regression
nls nonlinear regression
nlme package for linear and nonlinear mixed effect models
lme4 package for linear and generalized linear mixed
effect models
survival package for parametric and nonparametric survival
models
This object is usually quite complex and printing it returns only minimal output. A lot of generic functions have
however methods for the different regression models. Some important ones are:
91
Linear model
The linear model assumes that the relationship between the response variable (aka dependent variable, output) Y
and p independent variables (aka explanatory variables, predictors, covariates, features) X1 , . . . , Xp is linear and can
be represented as:
Y = β0 + β1 X1 + . . . βp Xp + ϵ,
where β0 is the model constant or intercept, βj is the regression coefficient corresponding to the variable Xj and ϵ is
a random error term which captures variation in Y not explained by X1 , . . . , Xp .
The model is linear in the unknown parameters βj , j = 0, . . . , p.
The variables X1 , . . . , Xp can come from different sources:
• quantitative inputs,
• transformations of quantitative inputs such as the log, square root, square,
• basis expansions e.g., X2 = X12 , X3 = X13 . . .
• numeric or dummy coding of the levels of qualitative inputs,
• interactions between variables: X3 = X1 · X2 .
where
• y = (y1 , . . . , yn ),
• X = (1, x1 , . . . , xp ) is an n × (p + 1) matrix of independent variables (including a vector of ones corresponding
to the intercept),
• β = (β0 , β1 , . . . , βp ) is the (p + 1) × 1 vector of regression coefficients (with intercept) and
• ϵ = (ϵ1 , . . . , ϵn ).
92
Conventions
Usually the following terms are used in a regression context:
Leverage
The leverages hi are useful in identifying influential observations. We know that (for a model with intercept):
Pn
• i=1
hi = ncol(X) where for a model with intercept ncol(X) = p + 1,
1
• In a model with intercept, hi ≥ n
var(β̂) = (X ⊤ X)−1 σ 2 .
β̂ ∼ N β, (X ⊤ X)−1 σ 2
• Also,
(n − p − 1)σ̂ ∼ σ 2 χ2n−p−1
β̂j ⊤ −1
t= √ ∼ tn−p−1 , vj the jth diag elem. of (X X)
σ̂ vj
• To check the significance of groups of coefficients simultaneously we can use the F -statistic which has an F
distribution under the null:
(RSS0 − RSS1 )/(p1 − p0 )
F = ∼ Fp1 −p0 ,n−p1 −1
RSS1 /(n − p1 − 1)
where RSS0 is the RSS of the smaller model with p0 variables and RSS1 is the RSS of the larger model with
p1 variables.
93
Goodness of fit
• Before discussing the residual analysis, we recall few quantities which quantify the extent to which the model
fits the data.
– residual standard error σ̂, standard deviation of the residuals which is an estimate of standard deviation
of ϵ.
∗ Roughly speaking, it is the average amount that the response will deviate from the true regression
line.
∗ It is measured in the units of the response.
– R2 statistic (coefficient of determination), which represents the proportion of variability in Y that can be
explained by the linear regression. Note that it always increases as we add more predictors to the model.
n
RSS X
R2 = 1 − , T SS = (yi − ȳ)2
T SS
i=1
• Note that these measures only apply to the linear regression case and do not easily extend to other types of
regression.
Residuals
The realizations of the random term ϵi are not observable. Therefore we use instead the residuals ri as an estimate.
Residuals are useful to evaluate the goodness of fit of the model and to check the model assumptions but they have
some design limitations since they must fulfill
n
X
ri = 0 and X ⊤ r = 0
i=1
Furthermore, residuals do not have the same variance by construction. Their variance decreases as the x values move
further away from the average x value:
var(ri ) = σ 2 (1 − hi )
There are two ways to standardize residuals to make them more useful for model diagnostics.
Standardized residuals
The standardized residuals are rescaled to have equal variances. They are computed using the leverages.
ri
r̃i = √
σ̂ 1 − hi
Studentized Residuals
A way to get “good” residuals when there is one bad data point is to see, what would happen if we dropped one
observation and use only the remaining n − 1 ones for the estimation. With that we predict the value of the omitted
value and can get so the so-called studentized residuals.
yi − ŷ(i)
ři = p ,
var(yi − ŷ(i) )
94
where yi is the omitted observation and ŷ(i) the prediction of yi based on a model that was fitted after excluding the
ith observation.
Note: The terminology for residuals is not everywhere the same, therefore check always carefully which definition
your software package uses.
Residual analysis
The following plots can be useful for the evaluating the model assumptions:
Note that model assumptions are usually checked rather visually and not by testing.
qqplot of residuals
• The qqplot allows us to check the assumption of normality. It is recommended to use for this purpose the
standardized residuals or the studentized residuals.
• The points should lie then on the bisector.
95
Outliers, leverage and influential points
Original data Outlier, high leverage, low residual, no influence
25
25
15
15
y
y
5
5
−5
−5
−2 0 2 4 6 8 10 −2 0 2 4 6 8 10
x x
High leverage, large residual, high influence Outlier, low leverage, large residual, low influence
25
25
15
15
y
y
5
5
−5
−5
−2 0 2 4 6 8 10 −2 0 2 4 6 8 10
x x
7
8
y1
y2
5
6
3
4
4 6 8 10 12 14 4 6 8 10 12 14
x1 x2
12
12
10
10
y3
y4
8
8
6
6
4 6 8 10 12 14 8 10 12 14 16 18
x3 x4
96
> data("anscombe")
> colMeans(anscombe)
x1 x2 x3 x4 y1 y2 y3 y4
9.000 9.000 9.000 9.000 7.501 7.501 7.500 7.501
> apply(anscombe, 2, sd)
x1 x2 x3 x4 y1 y2 y3 y4
3.317 3.317 3.317 3.317 2.032 2.032 2.030 2.031
Anscombe 1 Anscombe 2
−0.5 0.5
1
Residuals
Residuals
0
−1
−2.0
−2
5 6 7 8 9 10 5 6 7 8 9 10
Fitted Fitted
Anscombe 3 Anscombe 4
3
0.0 1.0
Residuals
Residuals
2
1
0
−1.5
−1
5 6 7 8 9 10 7 8 9 10 11 12
Fitted Fitted
97
Anscombe quartet: qqplots of standardized residuals
Anscombe 1 Anscombe 2
Sample Quantiles
Sample Quantiles
1.0
0.5
0.0
−1.5 −0.5
−1.5
−1.5 −0.5 0.5 1.0 1.5 −1.5 −0.5 0.5 1.0 1.5
Sample Quantiles
1.0
2
0.0
1
0
−1.5
−1
−1.5 −0.5 0.5 1.0 1.5 −1.5 −0.5 0.0 0.5 1.0 1.5
Design matrix
• As shown earlier, we assume we have a data matrix X which contains the explanatory variables.
• However the data matrix containing the variables X1 , . . . , Xp is usually not the matrix which we use in the
formulas earlier, but here X denotes the model or design matrix based upon the explanatory variables.
• For example the model matrix has usually a column of 1’s to model an intercept term.
• In the following slides we will discuss the forms explanatory variables can have to enter the model matrix. It
is important however that the design matrix has always full rank.
• Example: when centering the predictors, the intercept can be interpreted as the expected value of Y for average
values of the original predictors. Can be useful in some applications such as predicting house prices using m2
and number of bedrooms.
98
Continuous variables and collinearity
• A continuous variable need not necessarily enter linearly into the model. We can use transformations or add it
as a polynomial of higher order into the model.
• Adding polynomials however should be done with care cause they correlate with each other and can cause
problems when estimating parameters. (collinearity → X ⊤ X gets close to being not invertible).
• If the predictors show large amounts of correlation, either pairwise elimination can be employed or a principal
component analysis could be made and the principal components used instead of the actual variables.
• Ideally, for the ceteris paribus interpretation to hold, the predictors should be independent. This is rarely the
case in practice. If the predictor are independent, then the coefficients of the individual linear regression are
the same as the ones for the multiple linear regression.
– treatment contrast
– sum contrasts
– helmert contrast
– polynomial contrast
Treatment contrast I
• The treatment contrast is one of the most frequently used contrasts. The contrast has L − 1 columns.
• Assume we would have a factor with L = 4 levels, then the three columns would look like shown in the table.
• Assume x is the original categorical variable in the design matrix, this implies that we create l = 3 columns:
1 if xi = j + 1
dij = , j = 1, . . . , L − 1
0 otherwise
Treatment contrast II
• The regression model would be
• The interpretation for the coefficients of the dummies of levels 2-4 would then be the difference in the expected
response with respect to level 1 (assuming all other variables are 0).
• β0 + βj gives the expected response for group j.
• The effect of the first level could then be associated with the intercept β0 .
99
Sum contrast I
• The sum contrast is a popular contrast for balanced experimental designs.
• All columns in the contrast have to add up to 0.
Level [,1] [,2] [,3]
1 1 0 0
2 0 1 0
3 0 0 1
4 −1 −1 −1
• Assume x is the original categorical variable in the design matrix, this implies that we create l = 3 columns:
1 if xi = j
dij = −1 if xi = L , j = 1, . . . , L − 1
0 otherwise
Sum contrast II
• The regression model would be
yi = β0 + β1 di1 + β2 di3 + β3 di3 + . . . + ϵi
• The interpretation for the coefficients of the dummies would then be the difference in the expected response
for level or group j with respect to the overall mean (assuming all other variables are 0).
• The intercept β0 has the interpretation of the overall expected value of the response when the predictors are
set to zero.
• β0 + βj gives the expected response for group j for j = 1, . . . , L − 1 and for the Lth group the effect is
PL−1
β0 − j=1 βj .
Helmert contrast
• The helmert contrast is a popular contrast (for instance default in S-Plus).
• The first coefficient is the mean of the first two effects minus the first effects, the second coefficient is the mean
of all three effects minus the mean of the first two levels (parameter j compares the mean of effects for levels
1 : (j + 1) with the average of all effects for preceding factor 1 : j).
• It turns out the intercept is the mean of the means.
Polynomial contrast
• The polynomial contrast is recommended for ordered equidistant factors.
• It envisages the levels of the factor as corresponding to equally spaced values of an underlying continuous
covariate.
• It forces the effects to be monotonic in factor level order.
• It is however not that easy to interpret
100
Interactions
• A basic model assumption is that the different variables have an additive effect on the response.
• However, this is not always the case and one way to include non-additive effects in linear models is by using
interactions.
• Usually only interactions between two variables at the time are considered. The interaction terms go into the
design matrix as products of the columns of the two variables concerned.
• The interpretation of the interactions depends on the variable types of the variables involved.
Interpreting interactions
Interactions between 2 factors: This is the simplest case. Here one has basically different levels for all possible
combinations of the levels of the original factors.
Interactions between factor and numeric variable: In this case the numeric variable has still a linear effect
but now for each factor level there is a different slope.
Interactions between 2 numeric variables: This is a bit difficult to interpret. Basically if one variable is kept
fixed, then the other variable is linear where the slope depends on the value for which the other variable is kept
fixed.
Model selection
When fitting a regression model the aim is normally to find the smallest set of predictors which still describe the
data adequately well. Several strategies are available for model selection. Often different methods lead to different
models!
But what should always be considered:
There are no routine statistical questions, only questionable statistical routines. (D.R. Cox)
Backward selection
This method is rather simple and starts with all predictors in the model. Then we choose a “p-to-remove$ level α.
Here α does not necessarily have to be 0.05. Often a larger a like 0.1 or 0.15 is chosen.
The method works the following way:
The final model has all predictors with a p-value smaller than α.
Forward selection
The forward selection method is just the opposite of the backward selection. It starts with an empty model and adds
predictors to the model as long as one of the remaining predictors has smaller the “p-to-add” level α. Again α is
rather 0.1 or 0.15 than 0.05.
The method works the following way:
101
• choose the model which has the smallest p-value for the predictor which is smaller than α;
• fit all the models with the chosen predictor and one of the remaining predictors, keep again the one, which has
the smallest p-value smaller than α;
• continue until no predictor can be added anymore.
Stepwise selection
• The stepwise selection is a combination of the backward and the forward selection methods.
• It starts with the backward selection. But always after we delete a predictor from the model we check using
the forward method if we could add one of the other deleted predictors again to the model (we can add only
one of those not deleted in the last step).
• After adding or not adding one, we continue with the backward selection until we cannot add or remove anymore
any variable.
The selection methods described above are easy to implement but have some drawbacks.
• because of the one at the time scheme, the optimal model can be missed
• there is a multiple comparison problem especially when prediction is of interest, the stepwise model tends to
choose too “small” models
• one should still think if one of the excluded variables has a causal relationship and should therefore remain in
the model
• In general one can say that models with more parameters will fit the data better.
• Therefore criteria are available which “punish” the number of predictors added. Lets assume we have p predic-
tors in the model.
Deviance = −2 · log-likelihood
AIC = Deviance + 2p
• When we compare now models we prefer models with a smaller Information Criterion (higher log likelihood).
• These information criteria can also be used to substitute the p-values of the model selection methods. This
avoids for instance the multiple comparison problem.
• These criteria can also be compared when different distributional assumptions are made as long as they are
based on the same number of observations.
2
• Note: In the linear regression case, one can also use the adjusted coefficient of determination RA to compare
models:
2 (1 − R2 )(n − 1)
RA =1− .
n−p−1
102
Linear regression in R
The lm function
The function lm is the function for the basic linear model. Its usage is
If we have assigned a lm function call to an object we can directly extract from there many results using indexing.
E.g. coefficients, residuals, fitted.values, rank, weights, df.residual, call, terms, contrasts, xlevels, y.
But often the same with more options can be obtained using generic functions.
lm regression objects
• Assume we fitted with an appropriate model formula a regression model using the function lm and assigned
that to the object lm.out.
• Then a lot of functions have a generic output when applied to this object. What exactly these functions are
doing can be explored using the help pages.
• If we are for example interested to know what summary does to an lm object, we can ask the help for this by
using ?summary.lm.
• In general, for any generic function the specific help can be obtained this way.
• If we just ask for the lm.out object we get only minimal output. That is the model formula and the estimated
parameters.
Function update
• After creating a regression object one often wants to make only a small change, like changing the contrast or
removing or adding a variable.
• One could of course then just call the regression function again and make the changes there, but one could also
use the function update. This function applies to the old object the change which we defined in the update
function.
• Using for example +/- we could add or remove independent variables to / from the model.
• Assume lm.out contains the independent variables x1and x2.
> ## add x3
> lm.out.add <- update(lm.out, . ~ . + x3)
> # eliminate x2
> lm.out.minus <- update(lm.out, . ~ . - x2)
103
anova for one object
• In the case that we have only one lm object, the function anova returns an ANOVA table.
• This is however a sequential analysis of variance table for that fit.
– That is, the function returns a table which shows the reductions in the residual sum of squares as each
term of the formula is added in turn to the model, plus the residual sum of squares.
– The significance of this change is evaluated with an F-test.
– We start reading this table at the top.
• This means, that table says nothing about whether a variable belongs to the model, it makes only a statement
if the variable improved the fit when added to the model.
• The order how the model is specified matters here.
• For instance if we have the model formula y ~ x + z + w the ANOVA table would look different than when
you would have used y ~ w + z + x.
• For the first model, the last row of the ANOVA table would evaluate if a model with x, z and w is equal to a
model with only x and z. The next row compares then the models x and z against only x.
• We call models nested when there is a “largest” model and all other models could be seen as subsets of this
“largest”” model.
• If we submit now several lm.objects, which are nested, to the anova function the ANOVA table then compares
the different models.
• R however cannot make sure, that the models are nested. Therefore it just makes the assumption. It is a kind
of convention to start the list with the largest model and arrange them then in descending order.
• Then again we can start our comparison in the last row and compare the results sequentially.
na.action
• Model comparisons based on likelihood tests make the assumption that the design matrix is always the “same”.
• This must be taken into account when the data has missing values.
• Normally, when there are missing values, we delete observations which have missing values in the independent
variables that are used in the current model.
• Therefore often smaller models have more observations than larger models.
• In R we can choose in lm between at least two different na.actions:
– na.omit uses all observations that are possible (no missing values for residuals and fitted values and so
on)
– na.exclude also makes residuals and fitted values comparable when missing values are at hand.
plot
As mentioned earlier, most of the model assumptions of regressions can be evaluated using plots.
R provides by default four plots for diagnostics when an lm.objectis submitted to the plotfunction. Those plots
are:
It is often easier to evaluate the fit when plotting all four plots into one window using the par()function.
Other plots can be obtained using the which argument. For details see ?plot.lm.
104
model.matrix
• If one is interested how the design matrix looks one can use the function model.matrix.
• This function returns for an lm object the design matrix where one for example can see which contrast was
used for a factor and so on.
• Especially when there are factors in your model it might be a good idea to check this matrix so that you know
how to interpret the result.
Contrasts in R
As mentioned earlier, factors need dummy variables when they enter a regression model. Depending on that coding,
the interpretation of the parameter estimates changes. Which types of contrasts R uses by default can be found out
using the command:
> getOption("contrasts")
unordered ordered
"contr.treatment" "contr.poly"
There one can see what R uses as default contrasts for unordered factors and ordered factors.
The contrasts discussed earlier have in R the following names:
To specify the characteristics of each contrast like which is the default comparison level in the treatment contrast see
the help for the contrast of interest.
Recall here also the function relevel.
If one wants different contrasts than the default ones, there are two ways to change it. First we can change it globally,
so that it effects all applications where we need contrasts. Then we use the option command and specify there the
default contrast for unordered and ordered contrasts. E.g.:
Or we change it only in our regression function call. Here can even use several different contrasts. If we call for
example the regression function lm and we have two factors, named factor1 (with treatment contrast) and factor2
(helmert contrast) we could use:
Fitted values
• There is a generic function to extract fitted values from a regression object. That function is called fitted.
• However especially for lm objects there are also two other ways to extract fitted values. Let us call our lm
object again lm.out. Then we can get the fitted values using:
– fitted(lm.out)
– fitted.values(lm.out)
– lm.out$fitted
105
Residuals in R
> lm.out$res
Predictions in R I
The motivation to fit a regression model can have several reasons. One reason is to predict the dependent variable
given new subjects or to predict the development in the future.
It is quite easy to get predictions in R. One needs mainly two steps to get them. First one has to create a data frame
(data.new) that contains the settings of the independent variables for which a prediction is wanted. Then one uses
the function predict to obtain the predictions.
Assume one wants to predict for the lm.out object and one has a data frame data.new for which one wants to predict.
Then use:
When we are also interested in confidence intervals we can add the interval argument.
We can specify if we want the real prediction interval (which takes also the variation of the errors into account):
or the confidence interval i.e., the interval for the expected value of the response
Influence diagnostics in R
• For a regression object lm.out, the function influence.measures(lm.out) will return a data frame containing
all important influence measures such as:
– DFBETAS: measures the difference in each parameter estimate with and without the influential point.
– DFFITS: scaled difference between the ith fitted value ŷi obtained from the full data and the ith predicted
value ŷ(i) obtained by deleting the ith observation.
ri2 hi
– Cook’s distance: Di = (p+1)σ 2 (1−hi )2
⊤
– covariance ratios: 2
det(σ̂(i) (X(i) X(i) )−1 )/det(σ̂ 2 (X ⊤ X)−1 )
– leverage values for each observation (column hat).
• Observations assumed to be influential concerning any of the diagnostics are marked with an asterisk.
106
Model selection in R
• Automatic model selection is also possible in R. However not based on p-values but on AIC or BIC. The
function for this is the function step.
• It can perform all three different types of selections: backward, forward and stepwise.
• One can even specify minimal and maximal models between which we want to choose. In general one can
punish here the number of parameters with any weight k. But only the settings k = 2 (AIC) or k = log(n)
(BIC) have then a theoretical foundation.
Examples
Cherry Tree Example I
As a first example consider the trees data set from the MASS package. The data set contains the girth, height and
volume of 31 felled black cherry trees. The aim is to obtain a model which can be used to predict the volume of a
tree based on its height and girth.
> str(trees)
'data.frame': 31 obs. of 3 variables:
$ Girth : num 8.3 8.6 8.8 10.5 10.7 10.8 11 11 11.1 11.2 ...
$ Height: num 70 65 63 72 81 83 66 75 80 75 ...
$ Volume: num 10.3 10.3 10.2 16.4 18.8 19.7 15.6 18.2 22.6 19.9 ...
> summary(trees)
Girth Height Volume
Min. : 8.3 Min. :63 Min. :10.2
1st Qu.:11.1 1st Qu.:72 1st Qu.:19.4
Median :12.9 Median :76 Median :24.2
Mean :13.2 Mean :76 Mean :30.2
3rd Qu.:15.2 3rd Qu.:80 3rd Qu.:37.3
Max. :20.6 Max. :87 Max. :77.0
> plot(trees)
107
65 70 75 80 85
20
16
Girth
12
8
85
Height
75
65
70
50
Volume
30
10
8 10 12 14 16 18 20 10 20 30 40 50 60 70
Let us first fit a marginal model for the two explaining variables.
> options(show.signif.stars=FALSE)
> fit.girth <- lm(Volume ~ Girth, data = trees)
> summary(fit.girth)
Call:
lm(formula = Volume ~ Girth, data = trees)
Residuals:
Min 1Q Median 3Q Max
-8.065 -3.107 0.152 3.495 9.587
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -36.943 3.365 -11.0 7.6e-12
Girth 5.066 0.247 20.5 < 2e-16
Call:
108
lm(formula = Volume ~ Height, data = trees)
Residuals:
Min 1Q Median 3Q Max
-21.27 -9.89 -2.89 12.07 29.85
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -87.124 29.273 -2.98 0.00583
Height 1.543 0.384 4.02 0.00038
Call:
lm(formula = Volume ~ Girth + Height, data = trees)
Residuals:
Min 1Q Median 3Q Max
-6.406 -2.649 -0.288 2.200 8.485
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -57.988 8.638 -6.71 2.7e-07
Girth 4.708 0.264 17.82 < 2e-16
Height 0.339 0.130 2.61 0.014
> coef(fit.both)
(Intercept) Girth Height
-57.9877 4.7082 0.3393
> confint(fit.both)
2.5 % 97.5 %
(Intercept) -75.68226 -40.2931
Girth 4.16684 5.2495
Height 0.07265 0.6059
109
> fit.full <- lm(Volume ~ Girth + I(Girthˆ2) + Height + I(Heightˆ2),
+ data = trees)
> summary(fit.full)
Call:
lm(formula = Volume ~ Girth + I(Girthˆ2) + Height + I(Heightˆ2),
data = trees)
Residuals:
Min 1Q Median 3Q Max
-4.368 -1.670 -0.158 1.792 4.358
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.95510 63.01363 -0.02 0.988
Girth -2.79657 1.46868 -1.90 0.068
I(Girthˆ2) 0.26545 0.05169 5.14 2.4e-05
Height 0.11937 1.78459 0.07 0.947
I(Heightˆ2) 0.00172 0.01190 0.14 0.886
> coef(fit.full)
(Intercept) Girth I(Girthˆ2) Height I(Heightˆ2)
-0.955101 -2.796569 0.265446 0.119372 0.001717
> confint(fit.full)
2.5 % 97.5 %
(Intercept) -130.48147 128.57127
Girth -5.81548 0.22234
I(Girthˆ2) 0.15920 0.37169
Height -3.54890 3.78765
I(Heightˆ2) -0.02275 0.02619
> anova(fit.full)
Analysis of Variance Table
Response: Volume
Df Sum Sq Mean Sq F value Pr(>F)
Girth 1 7582 7582 1060.60 < 2e-16
I(Girthˆ2) 1 213 213 29.78 1e-05
Height 1 125 125 17.54 0.00029
I(Heightˆ2) 1 0 0 0.02 0.88645
Residuals 26 186 7
110
> with(trees, cor(Girth, Girthˆ2))
[1] 0.993
> with(trees, cor(Height, Heightˆ2))
[1] 0.9989
Call:
lm(formula = Volume ~ I(Girth - m.Girth) + I((Girth - m.Girth)ˆ2) +
I(Height - m.Height) + I((Height - m.Height)ˆ2), data = trees)
Residuals:
Min 1Q Median 3Q Max
-4.368 -1.670 -0.158 1.792 4.358
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 27.57375 0.70403 39.17 < 2e-16
I(Girth - m.Girth) 4.23689 0.20222 20.95 < 2e-16
I((Girth - m.Girth)ˆ2) 0.26545 0.05169 5.14 2.4e-05
I(Height - m.Height) 0.38031 0.09390 4.05 0.00041
I((Height - m.Height)ˆ2) 0.00172 0.01190 0.14 0.88645
Let’ s eliminate the squared term for Height as its not significant:
Call:
lm(formula = Volume ~ I(Girth - m.Girth) + I((Girth - m.Girth)ˆ2) +
I(Height - m.Height), data = trees)
Residuals:
111
Min 1Q Median 3Q Max
-4.293 -1.669 -0.102 1.785 4.349
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 27.6109 0.6431 42.93 < 2e-16
I(Girth - m.Girth) 4.2325 0.1963 21.56 < 2e-16
I((Girth - m.Girth)ˆ2) 0.2686 0.0459 5.85 3.1e-06
I(Height - m.Height) 0.3764 0.0882 4.27 0.00022
112
Cherry Tree Example XIV
Standardized residuals
Residuals vs Fitted Normal Q−Q
17 26 17
Residuals
1
0
−1
−4
30 18 30
10 20 30 40 50 60 70 80 −2 −1 0 1 2
Standardized residuals
Scale−Location Residuals vs Leverage
1718 30
2
17
0.5
0.8
0
Cook's distance
0.0
1
−2
30
18
Call:
lm(formula = log(Volume) ~ log(Girth) + log(Height), data = trees)
Residuals:
Min 1Q Median 3Q Max
-0.16856 -0.04849 0.00243 0.06364 0.12922
Coefficients:
113
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6.632 0.800 -8.29 5.1e-09
log(Girth) 1.983 0.075 26.43 < 2e-16
log(Height) 1.117 0.204 5.46 7.8e-06
Standardized residuals
Residuals vs Fitted Normal Q−Q
2
Residuals
0.05
0
−0.20
−2
15 16
18 15 16
18
Standardized residuals
15 16 11 17
1.0
Cook's distance
−2
0.0
18 0.5
2.5 3.0 3.5 4.0 0.00 0.05 0.10 0.15 0.20 0.25
Anorexia Example
Next we will use the anorexia data which is also in the MASS package.
The data set has three variables:
• Treat
Type of psychotherapy. Factor of three levels Cont, CBT and FT. Cont should be the reference group.
• Prewt
Weight of the subject before the treatment in lbs.
• Postwt
Weight of the subject after the treatment in lbs.
Of interest is now, if the treatments have different effects on the weight of the subjects.
114
Anorexia Example I
This data set contains the effect of different forms of therapy on the body weight of subjects suffering from anorexia.
Anorexia Example II
> summary(anorexia)
Treat Prewt Postwt
CBT :29 Min. :70.0 Min. : 71.3
Cont:26 1st Qu.:79.6 1st Qu.: 79.3
FT :17 Median :82.3 Median : 84.0
Mean :82.4 Mean : 85.2
3rd Qu.:86.0 3rd Qu.: 91.5
Max. :94.9 Max. :103.6
115
Anorexia Example III
95
90
preweight
85
80
75
70
Cont CBT FT
TREAT
Anorexia Example IV
95 100
postweight
90
85
80
75
Cont CBT FT
TREAT
Anorexia Example V
This shows how pipes (Chapter 3) can be used for summarizing data frames:
116
> anorexia |>
+ subset(select = Prewt:TREAT) |>
+ with(aggregate(cbind(Prewt, Postwt),
+ data.frame(TREAT),
+ function(x) c(mean=mean(x), sd = sd(x)))) |>
+ cbind(n.group = with(anorexia, tapply(Prewt, TREAT, length)))
TREAT Prewt.mean Prewt.sd Postwt.mean Postwt.sd n.group
Cont Cont 81.558 5.707 81.108 4.744 26
CBT CBT 82.690 4.845 85.697 8.352 29
FT FT 83.229 5.017 90.494 8.475 17
Anorexia Example VI
We fit a linear model with TREAT as the explanatory variable. Note that the treatment contrasts are used by default.
> summary(anfit1)
Call:
lm(formula = Postwt ~ TREAT, data = anorexia)
Residuals:
Min 1Q Median 3Q Max
-15.294 -3.730 -0.002 4.781 17.903
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 81.11 1.43 56.75 <2e-16
TREATCBT 4.59 1.97 2.33 0.0227
TREATFT 9.39 2.27 4.13 0.0001
The intercept coefficient gives the average weight (post treatment) for the Cont control (i.e., reference) group; the
TREATCBT coef shows that patients in the CBT group have on average 4.589 lbs more than the reference group; the
TREATFT coef shows that patients in the FT group have on average 9.386 lbs more than the reference group.
117
> anfit1b <- lm(Postwt ~ TREAT - 1, data = anorexia)
> model.matrix(anfit1b)[id,]
TREATCont TREATCBT TREATFT
12 1 0 0
24 1 0 0
36 0 1 0
48 0 1 0
60 0 0 1
72 0 0 1
Anorexia Example IX
The coefficients now represent the average weight post treatment in each category.
> summary(anfit1b)
Call:
lm(formula = Postwt ~ TREAT - 1, data = anorexia)
Residuals:
Min 1Q Median 3Q Max
-15.294 -3.730 -0.002 4.781 17.903
Coefficients:
Estimate Std. Error t value Pr(>|t|)
TREATCont 81.11 1.43 56.8 <2e-16
TREATCBT 85.70 1.35 63.3 <2e-16
TREATFT 90.49 1.77 51.2 <2e-16
Note: 1. The hypothesis tests are not so informative as one is typically interested in whether the differences among
the groups are significant.
Anorexia Example X
Anorexia Example XI
118
> summary(anfit1c)
Call:
lm(formula = Postwt ~ TREAT, data = anorexia, contrasts = list(TREAT = "contr.sum"))
Residuals:
Min 1Q Median 3Q Max
-15.294 -3.730 -0.002 4.781 17.903
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 85.7661 0.8819 97.26 < 2e-16
TREAT1 -4.6584 1.2078 -3.86 0.00025
TREAT2 -0.0696 1.1782 -0.06 0.95309
• (Intercept) (β0 ) - mean of the mean weight in each group (a bit weird . . . if the data is were balanced, it
would be the mean weight in whole dataset).
• TREAT1(β1 ) - deviation of average weight for Cont from the intercept.
• TREAT2(β2 ) - deviation of average weight for CBT from the intercept.
• We don’t have any coefficient for FT as it is by construction 1 − β1 − β2 .
• Note: The F-statistic and R2 do not change, we only transform the coefficients.
> summary(anfit1d)
Call:
lm(formula = Postwt ~ TREAT, data = anorexia, contrasts = list(TREAT = "contr.helmert"))
Residuals:
Min 1Q Median 3Q Max
-15.294 -3.730 -0.002 4.781 17.903
Coefficients:
Estimate Std. Error t value Pr(>|t|)
119
(Intercept) 85.766 0.882 97.26 < 2e-16
TREAT1 2.294 0.984 2.33 0.02267
TREAT2 2.364 0.674 3.51 0.00081
• (Intercept) (β0 ) - mean of the mean weight in each group (still weird . . . ).
• TREAT1(β1 ) - the average value of the means in
Cont and CBT is 2.29 lbs higher than the mean of Cont.
• TREAT2(β2 ) - the average value of the means in
Cont, CBT and FT is 2.364 lbs higher than the average value of the means in Cont and CBT.
• Note: Not the most intuitive . . .
Let’s have a look now at a scatterplot of Prewt and Postwt and color the points by TREAT
120
Anorexia Example XV
All groups
Cont
85
Postwt
Postwt
90
CBT
FT
75
75
70 75 80 85 90 95 70
Prewt
CBT
Postwt
Postwt
90
90
75
75
70 75 80 85 90 95
The relationship seems different for the different TREAT groups: Prewt
The following will fit a regression with the same slope but different intercepts for the different TREAT groups.
We can plot the regression lines for each class by first calculating the main effects (i.e., separate intercepts) from the
coefficients:
121
Anorexia Example XVII
Cont
95 100
CBT
FT
x
x
x xx x
xxx
90
x xx
Postwt
x x x
x
xx
xxxx x xxx x
xxx x
85
xxxxxxxxx x xxxx
xx xx xx x
xx x
xxx xx xx
80
x
x
x x x
75
70 75 80 85 90 95
Prewt
The following will fit a regression with different slopes and different intercepts for the different TREAT groups.
Call:
lm(formula = Postwt ~ TREAT * Prewt, data = anorexia)
Residuals:
Min 1Q Median 3Q Max
-12.812 -3.850 -0.915 4.001 15.964
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 92.051 18.809 4.89 6.7e-06
TREATCBT -76.474 28.347 -2.70 0.0089
TREATFT -77.232 33.133 -2.33 0.0228
Prewt -0.134 0.230 -0.58 0.5617
TREATCBT:Prewt 0.982 0.344 2.85 0.0058
TREATFT:Prewt 1.043 0.400 2.61 0.0112
122
> model.matrix(anfit3)[id,]
(Intercept) TREATCBT TREATFT Prewt TREATCBT:Prewt TREATFT:Prewt
12 1 0 0 88.7 0.0 0.0
24 1 0 0 77.5 0.0 0.0
36 1 1 0 80.5 80.5 0.0
48 1 1 0 76.5 76.5 0.0
60 1 0 1 86.7 0.0 86.7
72 1 0 1 87.3 0.0 87.3
Anorexia Example XX
> summary(anfit3)
Call:
lm(formula = Postwt ~ TREAT * Prewt, data = anorexia)
Residuals:
Min 1Q Median 3Q Max
-12.812 -3.850 -0.915 4.001 15.964
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 92.051 18.809 4.89 6.7e-06
TREATCBT -76.474 28.347 -2.70 0.0089
TREATFT -77.232 33.133 -2.33 0.0228
Prewt -0.134 0.230 -0.58 0.5617
TREATCBT:Prewt 0.982 0.344 2.85 0.0058
TREATFT:Prewt 1.043 0.400 2.61 0.0112
Note:
From the coefficients we can compute the intercepts and slopes of the different regression lines:
> coef(anfit3)
(Intercept) TREATCBT TREATFT Prewt TREATCBT:Prewt
92.0515 -76.4742 -77.2317 -0.1342 0.9822
TREATFT:Prewt
1.0434
123
> abline(coef(anfit3)[1],coef(anfit3)[4], col=cols[1])
> abline(coef(anfit3)[1]+coef(anfit3)[2],coef(anfit3)[4]+
+ coef(anfit3)[5], col=cols[2])
> abline(coef(anfit3)[1]+coef(anfit3)[3],coef(anfit3)[4]+
+ coef(anfit3)[6], col=cols[3])
> with(anorexia, points(Prewt,fitted(anfit3),
+ pch = "x",
+ col = cols[as.numeric(TREAT)]))
Cont
95 100
CBT x
FT x x
x xx
x x xx
xx x
90
xxx
Postwt
x x
x x xxx
x xxx x
85
x xxx
x x xx x xxxxxx
xxx xx xx xx x xx xx x x xxxx
xx
80
x
75
70 75 80 85 90 95
Prewt
To make the intercept more interpretable we could subtract the minimum value from Prewt:
> min(anorexia$Prewt)
[1] 70
> anorexia$Prewt2 <-
+ anorexia$Prewt - min(anorexia$Prewt)
> summary(anfit4)
Call:
lm(formula = Postwt ~ TREAT * Prewt2, data = anorexia)
124
Residuals:
Min 1Q Median 3Q Max
-12.812 -3.850 -0.915 4.001 15.964
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 82.659 2.954 27.98 <2e-16
TREATCBT -7.723 4.558 -1.69 0.0949
TREATFT -4.193 5.477 -0.77 0.4467
Prewt2 -0.134 0.230 -0.58 0.5617
TREATCBT:Prewt2 0.982 0.344 2.85 0.0058
TREATFT:Prewt2 1.043 0.400 2.61 0.0112
41 41
Residuals
3464 3464
2
0
−2 0
−15
75 80 85 90 95 100 −2 −1 0 1 2
Standardized residuals
−2 0
Cook's distance
0.0
0.5
How would the weight of a patient with a weight of 90lbs before the study change post-study depending on the
treatment?
125
> new.data <- data.frame(Prewt = c(90, 90, 90),
+ Prewt2 = c(20, 20, 20),
+ TREAT = factor(c("Cont","CBT","FT"),
+ levels = c("Cont","CBT","FT")))
> new.data
Prewt Prewt2 TREAT
1 90 20 Cont
2 90 20 CBT
3 90 20 FT
As a short final example the Scottish Hills data which gives the record times in 1984 for 35 Scottish hill races. The
variables are:
25
dist
15
5
7000
4000
climb
1000
200
time
50 100
126
Scottish Hills Example II
> summary(fit.lm)
Call:
lm(formula = time ~ dist + climb, data = hills)
Residuals:
Min 1Q Median 3Q Max
-16.22 -7.13 -1.19 2.37 65.12
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -8.99204 4.30273 -2.09 0.045
dist 6.21796 0.60115 10.34 9.9e-12
climb 0.01105 0.00205 5.39 6.4e-06
4
40
−1
50 100 150 −2 −1 0 1 2
Standardized residuals
Bens of Jura
Bens of Jura
Ben Nevis
Lairig Ghru 0.5
Cook's distance
−1
Knock Hill has a large residual and has Cook distance close to 0.5.
127
> # removing Knock Hill
> fit.lm.wKH <- update(fit.lm, subset = -18)
> coef(fit.lm.wKH)
(Intercept) dist climb
-13.53035 6.36456 0.01185
Bens of Jura race was identified as an influential point given it has a Cook distance of above one.
−1 1
−10
Standardized residuals
CairnHill
Black Table
1.0
0.5
Lairig Ghru
Cook's distance
Ben Nevis 1
−2
0.0
We can also weight observations in the regression model (by default all observations contribute equally to the esti-
mation of the coefficients)
128
> # weights 1/distˆ2 - long distance races get less weight
> fit.lm2 <- lm(time ~ dist + climb, weight = 1 / distˆ2, data = hills)
> summary(fit.lm2)
Call:
lm(formula = time ~ dist + climb, data = hills, weights = 1/distˆ2)
Weighted Residuals:
Min 1Q Median 3Q Max
-3.728 -1.521 -0.513 0.324 18.620
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.62715 6.26766 0.58 0.5668
dist 5.93960 1.71496 3.46 0.0015
climb 0.00384 0.00482 0.80 0.4321
> str(influence.measures(fit.lm2))
List of 3
$ infmat: num [1:35, 1:7] -0.22373 0.00126 0.01023 0.01639 0.00281 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:35] "Greenmantle" "Carnethy" "Craig Dunain" "Ben Rha" ...
.. ..$ : chr [1:7] "dfb.1_" "dfb.dist" "dfb.clmb" "dffit" ...
$ is.inf: logi [1:35, 1:7] FALSE FALSE FALSE FALSE FALSE FALSE ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:35] "Greenmantle" "Carnethy" "Craig Dunain" "Ben Rha" ...
.. ..$ : chr [1:7] "dfb.1_" "dfb.dist" "dfb.clmb" "dffit" ...
$ call : language lm(formula = time ~ dist + climb, data = hills, weights = 1/distˆ2)
- attr(*, "class")= chr "infl"
For the weighted regression Knock Hill is influential (it has low value for distance, so its influence gets higher due to
the weight). Also Cow Hills has a small distance.
> summary(influence.measures(fit.lm2))
Potentially influential observations of
lm(formula = time ~ dist + climb, data = hills, weights = 1/distˆ2) :
129
Standardized residuals
Residuals vs Fitted Normal Q−Q
2 4 6
Bens of Jura Knock Hill
Residuals
Knock Hill
40
Two Breweries
Bens of Jura
−20
−1
Black Hill
50 100 150 −2 −1 0 1 2
Standardized residuals
Scale−Location Residuals vs Leverage
Knock Hill Knock Hill
4
1.5
Bens of Jura
Black Hill Bens of Jura 0.5
Cook's Creag
distance
0
0.0
Dubh
0.5
Recap
We have seen so far multiple operations on matrices:
• dimension of a matrix can be obtained using the function dim, ncol and nrow.
130
• subsetting is done using the function [ where in an linear algebra context often the argument drop = FALSE
is important (if we need vectors to be row or column vectors)
• applying functions rowwise or columnwise: apply(x, 1, function(x) ...), apply(x, 2, function(x) ...)
• specialized functions for rowwise and columwise summaries: colMeans, rowMeans, colSums, rowSums (faster
than apply).
• standardizing matrix using scale.
131
[3,] 0.5736 0.09268 0.7765
> diag(X)
[1] 0.7570 0.7583 0.7765
> diag(3)
[,1] [,2] [,3]
[1,] 1 0 0
[2,] 0 1 0
[3,] 0 0 1
> diag(1:3)
[,1] [,2] [,3]
[1,] 1 0 0
[2,] 0 2 0
[3,] 0 0 3
> X
[,1] [,2] [,3]
[1,] 0.7570 0.18870 0.6224
[2,] 0.7758 0.75827 0.4159
[3,] 0.5736 0.09268 0.7765
> t(X)
[,1] [,2] [,3]
[1,] 0.7570 0.7758 0.57359
[2,] 0.1887 0.7583 0.09268
[3,] 0.6224 0.4159 0.77651
> det(X)
[1] 0.122
R has no built-in trace function in base R, but one can easily be defined:
Triangular matrices I
• Functions lower.tri() and upper.tri() can be used to obtain the lower and upper parts of matrices.
• The output of these functions is a matrix of logical arguments where TRUE represents the relevant triangular
elements.
132
> lower.tri(X)
[,1] [,2] [,3]
[1,] FALSE FALSE FALSE
[2,] TRUE FALSE FALSE
[3,] TRUE TRUE FALSE
Triangular matrices II
We can use these functions to e.g., replace all upper triangular elements by zero.
Matrix arithmetic
• Multiplication of a matrix X by a scalar a is the same as the multiplication of a vector with a scalar.
• Elementwise addition, multiplication etc can be done with +, * etc. (the dimensions must match)
> Y <- 2 * M
> Y + M
[,1] [,2] [,3]
[1,] 3 15 27
[2,] 6 18 30
[3,] 9 21 33
[4,] 12 24 36
Matrix multiplication
• For standard matrix multiplication the function is %*%. Let x and y be vectors and X and Y matrices.
• If vectors are used in the multiplication, R tries to figure out if they should be row or column vectors.
• If y %*% x when both have the same length, the inner product will be returned as a matrix.
133
Matrix multiplication II
> X
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
> Y
[,1] [,2] [,3]
[1,] 10 14 18
[2,] 11 15 19
[3,] 12 16 20
[4,] 13 17 21
> x %*% y
[,1]
[1,] 38
> x %*% X
[,1] [,2] [,3]
[1,] 14 32 50
> X %*% x
[,1]
[1,] 30
[2,] 36
[3,] 42
> Y %*% X
[,1] [,2] [,3]
[1,] 92 218 344
[2,] 98 233 368
[3,] 104 248 392
[4,] 110 263 416
134
Matrix inversion
To obtain the inverse of an invertible square matrix R has the function solve.
Computing the inverse is however computationally also expensive and there is hardly ever a good reason to
invert a matrix in statistical computations.
More on solve I
The function solve gives the inverse of a matrix actually only as a byproduct.
In general the purpose of the function is to solve systems of linear equations like
Ax = b ⇐⇒ x = A−1 b
More on solve II
Assume A is an (n × n) matrix. A−1 is the solution to the matrix equation AA−1 = In . This can be seen as n
separate systems of linear equations in n unknowns, whose solution are the columns of the inverse.
It would be inefficient to solve first n systems of linear equations in order to obtain the inverse, for the purpose of
solving one, namely the original, system.
Also, finding the inverse necessitates a lot of calculations, which give opportunities for much more rounding errors to
distort our results.
A = U DU ⊤ .
135
> X <- matrix(rnorm(300), ncol = 3)
> covX <- cov(X)
> eigen(covX, symmetric = TRUE)
eigen() decomposition
$values
[1] 1.3165 1.1031 0.9205
$vectors
[,1] [,2] [,3]
[1,] 0.6517 -0.18573 0.73540
[2,] 0.1693 0.98072 0.09768
[3,] -0.7394 0.06084 0.67055
• qr: QR-decomposition
• chol: Cholesky decomposition of a positive symmetric matrix
• svd: singular value decomposition
• outer : performs an operation on all possible pairs of elements of two vectors.
• kronecker: computes the Kronecker product.
But when working with matrices which have special properties like sparse matrices then it is worth checking the
Matrix package which has classes for the different types of matrices and can make then advantage of that knowledge
when for example doing decomposition or products.
Cholesky decomposition in R I
• If A is positive semidefinite, it possesses a square root such that B 2 = A.
• The Cholesky decomposition is similar, but the idea is to find an upper triangular matrix U such that:
U ⊤ U = A.
Cholesky decomposition in R II
136
Cholesky decomposition in R III
• The Cholesky decomposition can be employed to more efficiently to find the inverse of A:
A = U ⊤ U ⇒ A−1 = U −1 (U −1 )⊤
where computing U −1 can be done more easily given the triangular structure.
• A−1 can be obtained by chol2inv():
> chol2inv(chol.H3)
[,1] [,2] [,3]
[1,] 9 -36 30
[2,] -36 192 -180
[3,] 30 -180 180
Cholesky decomposition in R IV
• The Cholesky decomposition can be employed to solve linear systems of the form:
Ax = b ⇒ U ⊤ U x = b ⇒ U x = (U −1 )⊤ b.
• The first system is lower triangular so forward substitution can be used. The function forwardsolve() can be
used for this.
• The second system is upper triangular so back substitution can be used using function backsolve().
Cholesky decomposition in R V
For the problem H3 x = b where b = (1, 2, 3)⊤ we have:
> solve(H3, b)
[1] 27 -192 210
QR decomposition in R I
• Another way of decomposing a matrix A is through the QR decomposition:
A = QR
137
QR decomposition in R II
For more details on the output of qr() see the help page ?qr
$rank
[1] 3
$qraux
[1] 1.857143 1.684241 0.003901
$pivot
[1] 1 2 3
attr(,"class")
[1] "qr"
QR decomposition in R III
> qr.Q(H3qr)
[,1] [,2] [,3]
[1,] -0.8571 0.5016 0.1170
[2,] -0.4286 -0.5685 -0.7022
[3,] -0.2857 -0.6521 0.7022
QR decomposition in R II
• The QR decomposition can be used to obtain more accurate solutions to linear systems. If we want to solve
(here A is an (n × n) matrix):
Ax = b ⇒ QRx = b ⇒ Rx = Q⊤ b
• Here Q⊤ b can be easily calculated. Then the system can be easily solved using backsubstitution as R is an
upper triangular matrix.
• Function qr.solve(A, b) can be used to solve the above system.
• If the system is over-determined, one can find the solution x which minimizes the distance between b and Ax
using qr.solve(A, b). (Note: this can be useful in a linear regression context where A would be replaced by
the design matrix, b by the response and x by the vector of coefficients).
138
Computational approaches to hypothesis testing
• One-sample location test: H0 : µ = µ0 vs. H1 : µ ̸= µ0 under the assumption f (x−µ) = f (−(x−µ)) (symmetric
density).
• Two-sample location test: H0 : F (X) = F (Y ) vs. H1 : F (X) ̸= F (Y ) where the difference is at most in the
locations between the two groups X and Y .
• Test of independence: H0 : FX,Y = FX FY vs. H1 : FX,Y ̸= FX FY .
Cook book
The goal of classical hypothesis testing is to answer the question: Given a sample and an apparent effect, what is the
probability of seeing such an effect by chance?
139
Example: t-test in R I
• Student’s t-test is used to test in normal populations a hypothesis about the location or to compare the location
of two normal populations.
• In the latter case one must furthermore decide if the two populations have the same variance or not and if the
test is based on paired or independent observations.
Example: t-test in R II
• All these cases are considered in the function t.test.
– For the one sample case only a numeric vector has to be submitted and by default the hypothetical
location is the origin.
– For the two sample case one can submit either two numeric vectors or a formula where the independent
variable is a factor with two levels.
– In the two sample case the default setting assumes that the samples are independent and have different
variances.
– power.t.test: used to compute the power of the one- or two- sample t test, or determine parameters to
obtain a target power.
– pairwise.t.test: calculates pairwise comparisons between group levels with corrections for multiple
testing
data: crabs$RW
t = 15, df = 199, p-value <2e-16
alternative hypothesis: true mean is not equal to 10
95 percent confidence interval:
12.38 13.10
sample estimates:
mean of x
12.74
> str(one.samp.t)
List of 10
$ statistic : Named num 15
..- attr(*, "names")= chr "t"
$ parameter : Named num 199
..- attr(*, "names")= chr "df"
$ p.value : num 1.12e-34
$ conf.int : num [1:2] 12.4 13.1
140
..- attr(*, "conf.level")= num 0.95
$ estimate : Named num 12.7
..- attr(*, "names")= chr "mean of x"
$ null.value : Named num 10
..- attr(*, "names")= chr "mean"
$ stderr : num 0.182
$ alternative: chr "two.sided"
$ method : chr "One Sample t-test"
$ data.name : chr "crabs$RW"
- attr(*, "class")= chr "htest"
> one.samp.t$statistic
t
15.05
> one.samp.t$p.value
[1] 1.116e-34
data: RW by sex
t = 4.3, df = 188, p-value = 3e-05
alternative hypothesis: true difference in means between group F and group M is not equal to 0
95 percent confidence interval:
0.8086 2.1854
sample estimates:
mean in group F mean in group M
13.49 11.99
141
Types of errors
• In classical hypothesis testing, an effect is considered statistically significant if the p-value is below some threshold,
commonly 5% (known as significance level).
• Assume you have a testing problem with null hypothesis H0 vs. alternative H1 which is done at the significance level α.
• Two errors can basically occur during testing:
– Type 1 error: the effect is actually due to chance, but we will wrongly consider it significant (H0 is rejected but
true).
– Type 2 error: the effect is real but the test fails (H0 is not rejected but false).
– If there is no real effect, the null hypothesis is true, so we can compute the distribution of the test statistic by
simulating the null hypothesis. Call this distribution CDFT .
– Each time we run an experiment, we get a test statistic t which is drawn from CDFT Then we compute a p-value,
which is the probability that a random value from CDFT exceeds t, so that’s 1 − CDFT (t).
– The p-value is less than 5% if CDFT (t) is greater than 95%; that is, if t exceeds the 95th percentile. And how
often does a value chosen from CDFT (t) exceed the 95th percentile? 5% of the time.
– → If the null is true, p-value has a uniform distribution1 over the interval [0,1]
> set.seed(1)
> n <- 50
> m <- 5000
> Pvalue <- replicate(m, t.test(rnorm(n))$p.value)
> summary(Pvalue)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 0.243 0.496 0.496 0.745 1.000
142
Histogram of Pvalue
1.0
0.8
Density
0.6
0.4
0.2
0.0
Pvalue
Power of a test
• The power of a test is defined as 1 - false negative rate.
• The false negative rate is harder to compute because it depends on the actual effect size, and normally we don’t know
that.
• One option is to compute a rate conditioned on a hypothetical effect size.
143
Changes under the H0
Given then the testing problem and the assumptions made, test statistics can be computed and are compared to some theoret-
ically derived critical value often based on further assumptions.
The null hypothesis however often allows that, it if would be true, then “relabeling” of the data would be possible without
changing “anything”.
• One-sample location test: We can change the sign of x (after centering wrt to µ0 ) and nothing should change.
• Two-sample location test: We can switch observations between the two groups without changing anything.
• Test of independence: We can match the “X” part with the “Y” part from different observations.
Assuming then either that the data is normal or that the sample size is large with finite second moments we can use the one
sample t-test.
144
Randomized one sample t-test
If however the sample size is small and normality cannot be assumed, it is better to use the randomized one sample t-test.
So in this set-up because of the symmetry under the null the signs of x − µ0 can be “relabeled’ ’. Therefore the randomization
sign-change test version of this test has the following steps:
1. compute y = x − µ0 .
2. change randomly the√signs of y.
3. compute ti = ȳ/(sy / n).
4. compute how often ti is more extreme than t, where one has to remember that we test two-sided!
> set.seed(4321)
> n <- 30
> x1 <- rnorm(n,2,1)
> x2 <- x1 + 0.5
145
One sample t-test comparison II
1. The data is t3 + 2.
2. The data is t3 + 2.5.
> set.seed(4321)
> n <- 30
> y1 <- rt(n, df=3) + 2
> y2 <- y1 + 0.5
146
Tests for this problem
There are many tests for this problem and we assume the two sample t-test is well known.
In the following we will consider the (non-parametric) alternatives:
0 for zi < 0
n
S(zi ) =
1 for zi ≥ 0
as the sign of Zi and
n
X
R(zi ) = S(zi − zj )
j=1
as the rank of zi .
147
Two sample sign test in R
x y
> t.test(x, y)
data: x and y
t = -35, df = 957, p-value <2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2.096 -1.871
sample estimates:
mean of x mean of y
-0.02839 1.95483
148
Example I: Wilcoxon rank sum test
> wilcox.test(x, y)
data: x and y
W = 44624, p-value <2e-16
alternative hypothesis: true location shift is not equal to 0
> ST2S(x, y)
$K
[1] 294
$Z
[1] -22.57
$p.val
[1] 0
mean of x2
mean of y2
6
5
4
3
2
1
0
x2 y2
149
Example II: Two Sample t-test
data: x2 and y2
t = 8.4, df = 1351, p-value <2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.2132 0.3432
sample estimates:
mean of x mean of y
0.9859 0.7077
data: x2 and y2
W = 261174, p-value = 0.2
alternative hypothesis: true location shift is not equal to 0
$Z
[1] 1.753
$p.val
[1] 0.07965
150
Efficiency comparisons
Considering the data follows normal distributions with only the location change, the asymptotic relative efficiency (ARE) for
test comparisons are:
1. t vs K: 0.64
2. t vs WN : 0.95
However when the data has heavy tails then WN and K are more efficient than t.
3 As sample size tends to infinity, it is assumed that the alternative hypothesis approaches the null hypothesis to keep the
151
Simulation study for efficiency comparisons II
> P0
TT ST WT
0.050 0.067 0.051
> P1
TT ST WT
0.291 0.243 0.270
> power.t.test(30, 2.86/sqrt(60))
n = 30
delta = 0.3692
sd = 1
sig.level = 0.05
power = 0.2899
alternative = two.sided
> P2
TT ST WT
0.581 0.478 0.550
> power.t.test(30, 4.50/sqrt(60))
n = 30
delta = 0.5809
sd = 1
sig.level = 0.05
power = 0.5997
alternative = two.sided
> P3
TT ST WT
0.899 0.782 0.879
> power.t.test(30, 6.60/sqrt(60))
n = 30
delta = 0.8521
sd = 1
sig.level = 0.05
power = 0.9006
alternative = two.sided
152
Simulation study for efficiency comparisons V
> P1[2]/P1[1]
ST
0.8351
> P2[2]/P2[1]
ST
0.8227
> P3[2]/P3[1]
ST
0.8699
> P1[3]/P1[1]
WT
0.9278
> P2[3]/P2[1]
WT
0.9466
> P3[3]/P3[1]
WT
0.9778
> P0b
TT ST WT
0.043 0.062 0.042
> P1b
TT ST WT
0.317 0.245 0.291
> P2b
TT ST WT
0.630 0.489 0.615
> P3b
TT ST WT
0.917 0.808 0.904
153
Numerical optimization and root finding in R
Numerical optimization
• In many areas of statistics and mathematics we have to solve problems like: given a function f () which value of x makes
f (x) as small or as large as possible?
• E.g. In statistical modeling we may want to find the a set of parameters for a model which minimizes the expected
prediction errors.
• In some cases we might also have some constraints on x, e.g., the parameters shall be non-negative.
• Use of derivatives and linear algebra often lead to solutions for these problems, but not nearly always. This is where
numerical optimization comes in.
Root finding
• Root finding and unconstrained optimization are closely related: solving f (x) = 0 can be accomplished via minimizing
||f (x)||2 and unconstrained optima of f must be critical points i.e., solve ▽f (x) = 0.
• For linear least squares problems this can be solved “exactly” using techniques for linear algebra.
• Other problems can typically only be solved as limits of iterations xk = g(xk−1 ).
• If f is smooth and Jf (x) = [∂fi /∂xj (x)], the idea is based on the Taylor approximation f (xk ) ≈ f (xk−1 ) + (xk −
xk−1 )Jf (xk−1 ).
• If started close enough to the root, the following iteration will converge to a root of the above equation:
xk = xk−1 − Jf−1 (xk−1 )f (xk−1 ) = g(xk−1 ), x0 = initial guess
Function curve()
The function curve draws a curve corresponding to a function over the interval [from, to].
154
> f <- function(x) xˆ3 + 15 * x - 4
> curve(f, -5, 5)
> abline(h = 0)
> abline(v = xstar, lty = 2)
200
100
f(x)
0
−200
−4 −2 0 2 4
x
√
Function has 1 real root. The analytical solution is 2 − 3.
−40
−4 −2 0 2 4
x
√ √
Function has 3 real roots. The analytical solution is −2 − 3, −2 + 3 and 4.
155
One dimensional example II
• Quasi-Newton methods which replace Jf (xk−1 ) with another Bk which is less costly to compute or to invert can be
employed for root finding. The most famous such method is Broyden’s method.
– It considers approximations Bk which exactly satisfy the secant equation f (xk ) = f (xk−1 ) + Bk (xk − xk−1 ).
– The problem ends up being a convex quadratic optimization problem with linear constraints.
Tools in R
• These methods are only available in R extension packages.
• Package nleqslv has function nleqslv() which provides the Newton and Broyden methods.
• Package BB has function BBsolve() for Barzilai-Borwein solvers.
$fvec
[1] 1.500e-09 2.056e-09
$termcd
[1] 1
$message
[1] "Function criterion near zero"
$scalex
[1] 1 1
156
$nfcnt
[1] 12
$njcnt
[1] 1
$iter
[1] 10
$fvec
[1] 6.842e-10 1.764e-09
$termcd
[1] 1
$message
[1] "Function criterion near zero"
$scalex
[1] 1 1
$nfcnt
[1] 6
$njcnt
[1] 5
$iter
[1] 5
Both methods deliver the correct solution, Newton’s method needs less iterations.
$fvec
[1] 6.839e-10 1.762e-09
$termcd
[1] 1
$message
[1] "Function criterion near zero"
$scalex
157
[1] 1 1
$nfcnt
[1] 6
$njcnt
[1] 5
$iter
[1] 5
Optimization
In this section we will cover some algorithms and types of optimization problems and show how they can be implemented in R.
• Newton-Raphson
• Linear programming
• Quadratic programming
Newton-Raphson I
• If the function to be minimized has two continuous derivatives and we know how to evaluate them, we can employ the
Newton-Raphson algorithm.
• If we have a guess x0 at a minimizer, we use a local quadratic approximation for f (equivalently, a linear approximation
for ▽f ):
xk = xk−1 − Hf−1 (xk−1 ) ▽ f (xk−1 )
where Hf (x) = [∂ 2 f /∂xi ∂xj (x)] is the Hessian matrix of f at x.
Newton-Raphson II
• It can be shown that NR algorithm converges to a local minima if x0 is close enough to the solution.
• In practice it can be quite tricky:
– If the second derivative at xk−1 is 0 then there is no solution to the Taylor series approximation
– If xk−1 is too far from the solution, the Taylor approximation can be so inaccurate that f (xk ) is larger than
f (xk−1 ). In this case one can replace xk by (xk + xk−1 )/2.
Newton-Raphson example in R
f (x) = e−x + x4
Built-in functions
• In R there are several general purpose optimizers.
• For one-dimensional optimization optimize() can be used.
• Multidimensional optimizers:
– optim() which implements variations of Newton Raphson, Nelder-Mead’s simplex method and others.
– nlminb()
– nlm()
• If linear inequalities should be used on the parameters constrOptim() can be used.
158
Example: optimize()
f (x) = |x − 3.5| + |x − 2| + |x − 1|
$objective
[1] 2.5
Example: optim()
f (a, b) = (a − 1) + 3.2/b + 3 log(Γ(a)) + 3a log(b)
$value
[1] 3.099
$counts
function gradient
47 NA
$convergence
[1] 0
$message
NULL
Linear programming
When the function to optimize is linear and when the constraints we impose on the values x are linear, the problem is called
linear programming.
min c1 x1 + . . . + ck xk ,
x1 ,...,xk
subject to:
a11 x1 + . . . + a1k xk ≥ b1
..
.
am1 x1 + . . . + amk xk ≥ bm
and x1 ≥ 0, . . . , xk ≥ 0.
Linear programming in R
• Function lp() from lpSolve package can be used ro solve linear programming problems.
– argument objective.in - the vector of coefficients of the objective function.
– argument const.mat - a matrix containing the coefficients of the x variables in the left hand side of the constraints;
each row corresponds to a constraint.
– argument const.dir - a character vector containing the direction of the inequality constraints (>=, ==, <=).
– argument const.rhs - a vector containing the constants on the right-hand side of the constraints.
• It is based on the revised simplex method.
159
Linear programming pollution example I
• A company has developed two procedures for reducing sulfur dioxide and carbon dioxide emissions from its factory.
• The first procedure reduces equal amounts of each gas at a per unit cost of $5.
• The second procedure reduces the same amount of sulfur dioxide as the first method, but reduces twice as much carbon
dioxide gas; the per unit cost of this method is $8.
• The company is required to reduce sulfur dioxide emissions by 2 million units and carbon dioxide emissions by 3 million
units.
• What combination of the two emission procedures will meet this requirement at minimum cost?
• Since both methods reduce sulfur dioxide emissions at the same rate, the number of units of sulfur dioxide reduced will
then be x1 + x2 .
• Noting that there is a requirement to reduce the sulfur dioxide amount by 2 million units, we have the constraint
x1 + x2 ≥ 2.
• The carbon dioxide reduction requirement is 3 million units, and the second method reduces carbon dioxide twice as fast
as the first method, so we have the second constraint x1 + 2x2 ≥ 3.
• Finally, we note that x1 and x2 must be nonnegative, since we cannot use negative amounts of either procedure.
Note: Setting direction = "max" will allow the specification of maximization problems.
Multiple optima
It sometimes happens that there are multiple solutions for a linear programming problem. The problem has a solution at (1, 1)
and (3, 0).
min 4x1 + 8x2 ,
x1 ,x2
subject to:
x1 + x2 ≥ 2, x1 + 2x2 ≥ 3, x1 ≥ 0, . . . , xk ≥ 0
The lp() function does not alert the user to the existence of multiple minima.
160
Infeasibility
In this example it is clear that the constraints cannot be simultaneously satisfied:
Unboundednes
In some case the objective and the constraints give rise to an unbounded solution
max 5x1 + 8x2 ,
x1 ,x2
subject to:
x1 + x2 ≥ 2, x1 + 2x2 ≥ 3, x1 ≥ 0, . . . , xk ≥ 0
Quadratic programming I
• Linear programming problems are a special case of optimization problems in which a possibly nonlinear function is
minimized subject to constraints.
• Such problems are typically more difficult to solve and are beyond the scope of this course; an exception is the case where
the objective function is quadratic and the constraints are linear.
• A quadratic programming problem with m constraints is often of the form:
1 ⊤
min x Dx − d⊤ x
x 2
subject to constraints A⊤ x ≥ b. Here x is a vector of p unknowns, D is a positive definite p × p matrix, d is vector of
length p, A is a p × m matrix, and b is a vector of length m.
Quadratic programming II
In R the solve.QP() function of the quadprog package can be used to solve quadratic programs.
• Dmat - a matrix containing the elements of the matrix (D) of the quadratic form in the objective function
• dvec - a vector containing the coefficients of the decision variables x in the objective function
• Amat - a matrix containing the coefficients of the decision variables in the constraints; each row of the matrix corresponds
to a constraint
• bvec - a vector containing the constants given on the right-hand side of the constraints
• mvec - a number indicating the number of equality constraints. By default, this is 0. If it is not 0, the equality constraints
should be listed ahead of the inequality constraints.
161
Quadratic programming example I
• Assume we want to find out how much money to invest in a set of n stocks if we want to find the global minimum
variance portfolio σp,n
2 = x⊤ Σx where Σ is the covariance matrix of the returns of the stocks. Let µ denote the vector
of average returns of the individual stocks.
• Let x denote the vector of weights we want to invest in the portfolio and σp,n
2 = x⊤ Σx where Σ is the covariance matrix
of the returns of the stocks. Let µ denote the vector of average returns of the individual stocks.
• The problem is
min x⊤ Σx − µ⊤ x
x
Pn
• We want x = 1 (they are weights) and we do not allow shortselling e.g., xi ≥ 0.
i=1 i
where D = 2Σ and d = µ.
• The constraints can be specified as:
1 1 1 ! = 1
x1
1 0 0
x2
≥ 0
0 1 0 ≥ 0
x3
0 0 1 ≥ 0
| {z }
A⊤
$value
[1] -0.002021
$unconstrained.solution
[1] -0.02679 0.16071 0.47321
$iterations
[1] 2 0
$Lagrangian
[1] 0.003667 0.000000 0.000000 0.000000
$iact
[1] 1
162