0% found this document useful (0 votes)
11 views

Module 3 R Data Science

The document provides an introduction to R, a free and open-source programming language for statistical computing and graphics. It covers basic concepts such as data types, operators, and data structures, as well as how to create and manipulate vectors and data frames. Additionally, it discusses R's graphical capabilities and includes examples of plotting functions.

Uploaded by

siddharth.tcsc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Module 3 R Data Science

The document provides an introduction to R, a free and open-source programming language for statistical computing and graphics. It covers basic concepts such as data types, operators, and data structures, as well as how to create and manipulate vectors and data frames. Additionally, it discusses R's graphical capabilities and includes examples of plotting functions.

Uploaded by

siddharth.tcsc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 158

tutorial 1

R
Introduction and
descriptive
statistics
what is
R

 R is a free software programming


language and software
environment for statistical
computing and graphics.
(Wikipedia)

 R is open source.

2
what is
R
 R is an object oriented
programming language.

 Everything in R is an object.

 R objects are stored in memory,


and are acted upon by functions
(and operators).

3
how to get
R

Homepage CRAN:
Comprehensive R
Archive Network

http://www.r-project.org/

4
5
how to edit
R

Editor RStudio

http://www.rstudio.
com

6
RStudi
o

http://www.rstudio.co 8
m
how R
works

9
using R as a
calculator
 Users type expressions to the R interpreter.

 R responds by computing and printing the


answers.

10
arithmetic
operators

type operator action performed

+ addition

arithmetic - subtraction

results * multiplication
in
/ division
numeri
c ^ raise to power
value(s
)

11
logical
operators
type operator action performed

< less than


comparison
> greater than
results
in == equal to
logical
value(s) != not equal to
: <= greater than or equal to
>= less than or equal to
TRUE
FALSE
& boolean intersection operator (logical
connectors and)
| boolean union operator (logical or)
12
arithmetic
operators
addition / subtraction powe multiplication
r
> 5 + 5 > 3 ^ 2
> 10 - 2 > 2 ^ (-2)

multiplication / division Note:


> 10 * 10 > 100 ^ (1/2)
> 25 / 5 > sqrt(100) is equivalent
to

13
logical
operators
> 4 < 3
[1] FALSE

> 2^3 == 9
[1] FALSE

> (3 + 1) != 3
[1] TRUE

> (3 >= 1) & (4 == (3+1))


[1] TRUE

14
assignmen
t
 Values are stored by assigning them a name.
 The statements

> z = 17
> z <- 17
> 17 -> z

all store the value 17 under the name z in the


workspace.

Assignment operators are: <- , = , ->

15
data
types

There are three basic types or modes of


variables:

 numeric (numbers: integers,


real)
 logical (TRUE, FALSE)
 character (text strings,
in "")
Note: A general missing value indicator is
NA.
The type is shown by the mode() function.
16
data
> a = 49
types
# numeric
> sqrt(a)
[1] 7
> mode(a)
[1] "numeric"

> a = "The dog ate my homework" # character


> a
[1] "The dog ate my homework"
> mode(a)
[1] "character"

> a = (1 + 1 == 3) # logical
> a
[1] FALSE
> mode(a)
[1] "logical" 1
7
data
structures

Elements: numeric, logical, character in

 vectors ordered sets of elements of one type


 data.frames ordered sets of vectors (different
 vector
matric
types)
ordered sets of vectors (all of one vector
es
type) ordered sets of anything.
 lists

18
creating
vectors
The function c( ) can combine several elements into
vectors.
> x = c(1, 3, 5, 7, 8, 9) # numerical vector
> x
[1] 1 3 5 7 8 9

> z = c("I","am","Ironman") # character vector


> z
[1] "I" "am" "Ironman"

> x = c(TRUE,FALSE,NA) # logical vector


> x
[1] TRUE FALSE NA
19
combining
vectors
The function c( ) can be used to combine both
vectors and elements into larger vectors.

> x = c(1, 2, 3, 4)
> c(x, 10)
[1] 1 2 3 4 10

> c(x, x)
[1] 1 2 3 4 1 2 3 4

In fact, R stores elements like 10 as vectors of


length one, so that both arguments in the
vector
expression above are
s 20
sequence
s
A useful way of generating vectors is using the
sequence operator. The expression n1:n2,
generates the sequence of integers ranging from
n1 to n2.
> 1:15
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13
[14] 14 15

> 5:-5
[1] 5 4 3 2 1 0 -1 -2 -3 -4 -5
> y = 1:11
> y
[1] 1 2 3 4 5 6 7 8 9 10 11
2
1
extracting
elements
> x = c(1, 3, 5, 7, 8, 9)
> x[3] # extract 3rd
position
[1] 5
> x[1:3] # extract positions 1-3
[1] 1 3 5

> x[-2] # without 2nd position


[1] 1 5 7 8 9

> x[x<7] # select values < 7


[1] 1 3 5

> x[x!=5] # select values not equal to 5


[1] 1 3 7 8 9
22
data
frame

Data frames provide a way of grouping a


number of related vectors into a single data
object.

The function data.frame() takes a number of


vectors with same lengths and returns a single
object containing all the variables.

df = data.frame(var1, var2, ...)

23
data
frame
In a data frame the column labels are the vector
names.

Note: Vectors can be of different types in a data


frame (numeric, logical, character).

Data frames can be created in a number of


ways:

 Binding together vectors by the function


data.frame( ).
24
data
frame
> time = c("early","mid","late","late","early")
> type <- c("G", "G", "O", "O", "G")
> counts <- c(20, 13, 8, 34, 7)
> data <-
data.frame(time,type,counts)
> data
1 early G counts
time type 20
2 mid G 13
3 late O 8
4 late O 34
5 early G 7

> fix(data)

25
example data: low birth
weight
name text variable type
low low birth weight of the baby nominal: 0 'no >=2500g' 1 'yes <2500g'
age age of mother continuous: years
lwt mother's weight at last period continuous: pounds
race ethnicity nominal: 1 'white' 2 'black' 3 'other'
smoke smoking status nominal: 0 'no' 1 'yes'
ptl premature labor discrete: number of
ht hypertension nominal: 0 'no' 1 'yes'
ui presence of uterine irritability nominal: 0 'no' 1 'yes'
ftv physician visits in first discrete: number of
trimester
bwt birthweight of the baby continous: g

The birthweight data frame has 189 rows and 10 columns. The
data were collected at Baystate Medical Center, Springfield,
Mass during 1986. 26
example data: low birth
weight
loading: library(MASS), the dataframe is called
birthwt.

Overview over dataframes:

 dim(birthwt)
 summary(birthwt)
 head(birthwt)
 str(birthwt)

27
extracting
vectors

data$vectorlabel gives the vector named


vectorlabel of the dataframe named data.

Extracting elements from this vector is done as


usually.

> birthwt$age

> birthwt$age[33]

> birthwt$age[1:10]

28
some functions in
R

name function
summary(x) summary statistics of the elements
of x
max(x) maximum of the elements of x
min(x) minimum of the elements of x
sum(x) sum of the elements of x
mean(x) mean of the elements of x
sd(x) standard deviation of the elements
of x
median(x) median of the elements of x
quantile(x, probs=…) quantiles of the elements of x
sort(x) ordering the elements of x 29
some functions in
R

> mean(birthwt$age)
[1] 23.2381

> max(birthwt$age)
[1] 45

> min(birthwt$age)
[1] 14

30
getting
help
to get help on thesd() function you can type
either of
> help(sd)
> ?sd

31
sorting
vectors
Sorting / ordering of data in vectors
with the function sort()

> help(sort)
> x=sort(birthwt$age, decreasing=FALSE)
> x[1:10]
[1] 14 14 14 15 15 15 16 16 16 16

> x=sort(birthwt$age, decreasing=TRUE)


> x[1:10]
[1] 45 36 36 35 35 34 33 33 33 32

> x[25] # 25th


highest age
[1] 30
32
graphic
s
R has extensive graphics
facilities.
Graphic functions are
differentiated in
 high-level graphics functions
 low-level graphics functions

The quality of the graphs produced by R is


often cited as a major reason for using it in
preference to other statistical software
systems.
33
high-level
graphics
name function
plot(x, y) bivariate plot of x (on the x-axis) and y (on the y-axis)
hist(x) histogram of the frequencies of x
barplot(x) histogram of the values of x; use horiz=FALSE for horizontal
bars
dotchart(x) if x is a data frame, plots a Cleveland dot plot (stacked plots
line-by- line and column-by-column)
pie(x) circular pie-chart
boxplot(x) box-and-whiskers plot
stripplot(x) plot of the values of x on a line (an alternative to boxplot()
for small sample sizes)
mosaicplot(x) mosaic plot from frequencies in a contingency table
qqnorm(x) quantiles of x with respect to the values expected under a
normal law
34
high-level
graphics

> hist(birthwt$age)
> boxplot(birthwt$age)

35
hands-on
example
loading: library(mlbench),
the dataframe is called PimaIndiansDiabetes2.

Load the dataframe into your workspace


with the data("PimaIndiansDiabetes2")
command.

Get an overview with the functions dim, head.


Calculate the mean and median of the variable
insulin. Remove NAs for the calculation with the
na.rm = TRUE option in mean and median
functions.
36
tutorial 2

R
Graphics and
probability theory
graphic
s
R has extensive graphics
facilities.
Graphic functions are
differentiated in
 high-level graphics functions
 low-level graphics functions

The quality of the graphs produced by R is


often cited as a major reason for using it in
preference to other statistical software
systems.
38
high-level
graphics
name function
plot(x, y) bivariate plot of x (on the x-axis) and y (on the y-axis)
hist(x) histogram of the frequencies of x
barplot(x) histogram of the values of x; use horiz=FALSE for horizontal
bars
dotchart(x) if x is a data frame, plots a Cleveland dot plot (stacked plots
line-by- line and column-by-column)
pie(x) circular pie-chart
boxplot(x) box-and-whiskers plot
stripplot(x) plot of the values of x on a line (an alternative to boxplot()
for small sample sizes)
mosaicplot(x) mosaic plot from frequencies in a contingency table
qqnorm(x) quantiles of x with respect to the values expected under a
normal law
39
plot
function
The core R graphics command is plot(). This is an
all-in- one function which carries out a number of
actions:

 It opens a new graphics window.

 It plots the content of the graph (points, lines


etc.).

 It plots x and y axes and boxes around the


plot and produces the axis labels and title. 40
plot
function
Parameters in the plot() function
are:
x x-coordinate(s)
 y y-coordinates (optional, depends
on x)

41
plot
function
To plot points with x and y coordinates or two
random variables for a data set (one on the x axis,
the other on the y axis; called a scatterplot) , type:

> a = c(1,2,3,4)
> b = c(4,4,0,5)
> plot(x=a,y=b)
> plot(a,b) # the same

42
plot
function
To plot points with x and y coordinates or two
random variables for a data set (one on the x axis,
the other on the y axis; called a scatterplot), type:

> library(MASS)
> plot(x=birthwt$age,y=birthwt$lwt)
# lwt: mothers weight in pounds
> plot(x=birthwt$age[1:10],y=birthwt$lwt[1:10])
# first 10 mothers

43
plot
function
Another example:

> a = seq(-5, +5, by=0.2)


# generates a sequence from -5 to +5 with increment
0.2
[1] -5.0 -4.8 -4.6 -4.4 -4.2 -4.0 -3.8 -3.6 -3.4 -
3.2
...
[45] 3.8 4.0 4.2 4.4 4.6 4.8 5.0
> b = a^2 # squares all components of a
> plot(a,b)

44
plot
function
Parameters in the plot()
function are (see help(plot) and
help(par)):
 x x-coordinate(s)
 y y-coordinates (optional, depends
 main, on x) title and subtitle
sub axes labels
 xlab, range of values for x
ylab and y type of plot
 xlim, type of
ylim lines plot
 type symbol
 lty scale
 pch factor 4
5
plot symbol / line
type
plot type: plot symbol:
type=
“p point
‘‘ s pch=
‘‘l‘ lines
‘ both
‘‘b steps
“ verti
‘‘s cal
“ lines
‘‘h nothi
line lty
“ ng
‘‘n type: =

46
plot
function
> a = seq(-5, +5, by=0.2)
> b = a^2
> plot(a, b)
> plot(a,b,main="quadratic function")
> plot(a,b,main="quadratic function",cex=2)
> plot(a,b,main="quadratic function",col="blue")

47
plot
function
> a = seq(-5, +5, by=0.2)
> b = a^2
> plot(a,b,main="quadratic function",type="l")
> plot(a,b,main="quadratic function",type="b")
> plot(a,b,main="quadratic function",pch=2)

quadratic function

2
5
2
0
1
5
y

1
0
5
0

-4 -2 0 2 4
4
x 8
probability theory,
factorials

Binomial coefficients can be


computed by choose(n,k):

> choose(8,5)
[1] 56

49
functions for random
variables
Distributions can be easily calculated or simulated
using R.
The functions are named such that the first
letter states what the function calculates or
simulates
d = density function (probability
function)
p = distribution function
q = quantile (inverse distribution)
r = random number generation
and the last part of the name of the function
specifies the type of distribution, e.g.
 binomial dististribution 50
binomial
distribution
Probability
function:
 n k nk
f(k)  P(X     (1 
k)   k  )

• dbinom(x, size, prob)

x k
size n
prob π
51
normal
distribution
Density
function:
2

1 
( x)
f(x) e 22

 
2

• dnorm(x, mean, sd)

52
normal
distribution
Calculating the probability density
function:
> dnorm(x=2, mean=6, sd=2)
[1] 0.02699548

0.2
0

0.1
5
f(x
)

0.1
0

0.0 0 2 4 6 8 1 1
5 0 2
x
0.0 53
0
normal
distribution
Distribution
function:
b
f(x)
F(b)   'density'

f(x)dx x

b

• pnorm(q, mean, sd)


q: b

54
normal
distribution
Distribution
function:

N(10,25) f(x)
> pnorm(q=13, mean=10, sd=5)
distribution 'density'

[1] 0.7257469 x
1
3

55
binomial
distribution
Probability function:

> dbinom(x=5, size=50, prob=0.15)

# Probability of having exactly 5 successes in 50


independent observations/measurements with a success
probability of 0.15 each

[1] 0.1072481

> dbinom(5, 50, 0.15) # the same

56
normal
distribution
Plotting the density of a N(5,49)
distribution:

> x_values=seq(-15, 25, by=0.5)


> y_values=dnorm(x_values, mean=5, sd=7)
> plot(x_values,y_values,type="l")

57
hands-on
example
loading: library(mlbench),
the dataframe is called PimaIndiansDiabetes2.

Make a scatter plot for the variables glucose and

insulin. What are the possible realizations of a

random variable X
distributed according to Bin(4,0.85)?
Calculate all possible values of the probability
function of X. Plot the probability function of X with
the possible realizations of X on the x axis and the
corresponding values of the probability function on 58
tutorial 3

R
Random numbers
and factors
functions for random
variables
Distributions can be easily calculated or simulated
using R.
The functions are named such that the first
letter states what the function calculates or
simulates
d = density function (probability
function)
p = distribution function
q = quantile (inverse distribution)
r = random number generation
and the last part of the name of the function
specifies the type of distribution, e.g.
 binomial dististribution 6
0
binomial
distribution
Generating random
realizations:
 n k nk
f(k)  P(X  k)    (1 
 k  )

• rbinom(n, size, prob)

n: number of samples to
draw size: n
prob=π
output: number of 6
1
normal
distribution
Generating random
realizations: 2

1 
( x)
f(x) e 22

 
2
• rnorm(n, mean, sd)
n: number of samples to
draw

62
t
distribution
Quantiles:

• qt(p, df)
p: quantile
probability
df: degrees of
freedom

63
binomial
distribution
Generating random realizations:

> rbinom(n=1, size=50, prob=0.15)

# Generating one sample of 50 independent


observations/measurements with a success probability
of 0.15 each

[1] 14 # 14 successes in this simulation

> rbinom(n=1, size=50, prob=0.15)

[1] 7
64
binomial
distribution
Generating random realizations:

> rbinom(n=10, size=50, prob=0.15)

# Generating 10 samples

[1] 14 10 6 12 8 6 7 10 5 9

# The number of successes for all samples

65
normal
distribution
> values=rnorm(10, mean=0, sd=1)
> values
[1] -0.56047565 -0.23017749 1.55870831 0.07050839
0.12928774 1.71506499 0.46091621 -1.26506123
-0.68685285 -0.44566197
# 10 simulations from a N(0,1) distribution
> mean(values)
[1] 0.07462565

66
t
distribution
Quantiles:

• qt(p, df)
p: quantile probability
df: degrees of freedom
> qt(p=0.95,df=9)
[1] 1.833113
> qt(p=0.95,df=99)
[1] 1.660391
> qnorm(p=0.95,mean=0,sd=1)
[1] 1.644854
> qt(p=0.975,df=99)
[1] 1.984217 = for α=0.05, n=100
t1α / 2,n1
67
object
classes
All objects in R have a class. The class attribute
allows R to treat objects differently (e.g. for
summary() or plot()).
Possible classes are:
 numeric
 logical
 character
 list The class is shown
by the class()
 matrix
function.
 data.fra
me
 array
 factor 68
factors

 Categorical variables in R are often


specified as factors.
 Factors have a fixed number of categories,
called levels.
 summary(factor) displays the frequency
of the factor levels.
 Functions in R for creating factors:
factor(), as.factor()
 levels() displays and sets levels.

69
factors

• factor(x,levels,labels)
• as.factor(x)

x: vector of data, usually small number of

values levels: specifies the values

(categories) of x labels: labels the levels

70
factors

> smoke = c(0,0,1,1,0,0,0,1,0,1)


> smoke
[1] 0 0 1 1 0 0 0 1 0 1

> summary(smoke)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0 0.0 0.0 0.4 1.0 1.0

> class(smoke)
[1] "numeric"

71
factors

> smoke_new=factor(smoke)

> smoke_new
[1] 0 0 1 1 0 0 0 1 0 1
Levels: 0 1

> summary(smoke_new)
0 1
6 4

> class(smoke_new)
[1] "factor"

72
factors

> smoke_new=factor(smoke,levels=c(0,1))
> smoke_new
[1] 0 0 1 1 0 0 0 1 0 1
Levels: 0 1
> smoke_new=factor(smoke,levels=c(0,1,2))

> smoke_new
[1] 0 0 1 1 0 0 0 1 0 1
Levels: 0 1 2

> summary(smoke_new)
0 1 2
6 4 0
73
factors

> smoke_new=factor(smoke,levels=c(0,1),
labels=c("no", "yes")

> smoke_new
[1] no no yes yes no no yes no yes
no
Levels: no yes
> summary(smoke_new)
no yes
6 4

74
factors

> library(MASS)
> summary(birthwt$race)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 1.000 1.000 1.847 3.000 3.000
> race_new=as.factor(birthwt$race)
> summary(race_new)
1 2 3
96 26 67
> levels
(race_
new)
[1] "1"
"2" "3"
> levels
(race_ 75
hands-on
example
Sample 20 realizations of a N(0,1) distribution.

Calculate mean and standard deviation.

What is the formula for the confidence interval


for the mean for unknown σ?

For a 90% confidence interval and the above


sample: What are the parameters α and n? Which
value has t1- α/2,n-1?

Calculate the 90% confidence interval for our


example. 76
tutorial 4

R
Reading data from
files, frequency
tables
functions for random
variables
Distributions can be easily calculated or simulated
using R.
The functions are named such that the first
letter states what the function calculates or
simulates
d = density function (probability
function)
p = distribution function
q = quantile (inverse distribution)
r = random number generation
and the last part of the name of the function
specifies the type of distribution, e.g.
 binomial dististribution 78
normal
distribution
Quantiles:

• qnorm(p, mean, sd)

p: quantile probability

> qnorm(p=0.95,mean=0,sd=1)
[1] 1.644854 = z
> qnorm(p=0.975,mean=0,sd=1) 0.95
[1] 1.959964 for α=0.05
= z1α /
2
79
reading data: working
directory
 For reading or saving files, a simple file name
identifies a file in the working directory. Files in
other places can be specified by the path name.

 getwd() gives the current working directory.

 setwd("path") sets a specific directory as


your working directory.

 Use setwd("path") to load and save data


in the directory of your choice.

80
reading
data

 The standard way of storing statistical data is


to put them in a rectangular form with rows
corresponding to observations and columns
corresponding to variables.
 Spreadsheets are often used to store and
manipulate data in this way, e.g. EXCEL.
 The function read.table() can be used to
read data which has been stored in this way.
 The first argument to read.table()
identifies the file to be read.

81
reading
data
Optional arguments to read.table() which
can be used to change its behaviour.
 Setting header=TRUE indicates to R that the
first row of the data file contains names for
each of the columns.
 The argument skip= makes it possible to
skip the specified number of lines at the top
of the file.
 The argument sep= can be used to specify a
character which separates columns. (Use
sep=";" for csv files.)
 The argument dec= can be used to specify a 82
example data:
infarct
name label
nro identifier
grp group (string) 'control', 'infarct'
code coded group 0 'control', 1 'infarct'
sex sex 1 'male', 2 'female'
age age years
height body height cm
weight body weight kg
blood sugar blood sugar level mg/100ml
diabet diabetes 0 'no', 1 'yes'
chol cholesterol level mg/100ml
trigl triglyceride level mg/100ml
cig cigarettes number
of
(case/control
study) 83
example data:
infarct
> setwd("C:/Users/Präsentation/MLS")
> mi = read.table("infarct data.csv")
Error in scan(file, what,...: line 2 did not have 2
elements # wrong separator

> mi = read.table("infarct data.csv",sep=";")


> summary(mi) # no variable names

> mi = read.table("infarct data.csv",sep=";",


header=TRUE)
> summary(mi) # with variable names

84
frequency
tables
 table(var1, var2) gives a table of the
absolute frequencies of all combinations of
var1 and var2. var1 and var2 have to attain a
finite number of values (frequency table,
cross classification table, contingency table).
var1 defines the rows, var2 the columns.
 addmargins(table) adds the sums of rows
and
columns.
 prop.table(table) gives the relative
frequencies, overall or with respect to
rows or columns.
85
frequency
tables
> grp_sex=table(mi$grp,mi$sex)
> grp_sex

1 2
control 25 15
infarct 28 12

> addmargins(grp_sex)

1 2 Sum
control 25 15 40
infarct 28 12 40
Sum 53 27 80

86
frequency
tables
> prop.table(grp_sex)

1 2
control 0.3125 0.1875
infarct 0.3500 0.1500
> prop.table(grp_sex,margin=1)

1 2
control 0.625 0.375
infarct 0.700 0.300 # rows sums to 1

> prop.table(grp_sex,margin=2)

1 2
control 0.4716981 0.5555556
infarct 0.5283019 0.4444444 # columns sum to 1
87
hands-on
example
Load the dataset from the file bdendo.csv
into the workspace.

Generate a table of the variables d (case-control


status) and dur (categorical duration of
oestrogen therapy).
Generate a table of the variables d (case-control
status)
and agegr (age
group). Compare the
two tables.

88
tutorial 5

R
Installing
packages, the
package "pROC"
R
packages
 R consists of a base level of functionality
together with a set of contributed libraries
which provide extended capabilities.
 The key idea is that of a package which
provides a related set of software
components, documentation and data sets.
 Packages can be installed into R. This
needs administrator rights.

90
pROC – diagnostic
testing
Package: pROC
Type: Package
Title: display and analyze ROC
curves Version: 1.7.1
Date: 2014-02-20
Encoding: UTF-8
Depends: R (>= 2.13)
Imports: plyr, utils, methods, Rcpp (>= 0.10.5)
Suggests: microbenchmark, tcltk, MASS, logcondens,
doMC, doSNOW
LinkingTo: Rcpp
Author: Xavier Robin, Natacha Turck, Alexandre Hainard,
Natalia Tiberti, Frédérique Lisacek, Jean-Charles
Sanchez and Markus Müller.
Maintainer: Xavier Robin <[email protected]>

91
installing
packages
You can install R packages using the install.packages()
command.
> install.packages("pROC")
Installing package(s) into
‘C:/Users/Amke/Documents/R/win-library/2.15’
(as ‘lib’ is unspecified)

downloaded 827 Kb
package ‘pROC’ successfully unpacked and MD5 sums
checked

The downloaded binary packages are in


C:\Users\Amke\AppData\Local\Temp\
RtmpUJPoia\downl
oaded_packages 92
Installing R packages using the
menu:

93
94
using installed
packages

When R is running, simply type:


> library(pROC)

This adds the R functions in the library to the


search path. You can now use the functions and
datasets in the package and inspect the
documentation.

95
cite
packages

To cite the package pROC in publications use:


> citation("pROC")
...
Xavier Robin, Natacha Turck, Alexandre Hainard,
Natalia Tiberti, Frédérique
Lisacek, Jean-Charles Sanchez and Markus Müller
(2011). pROC: an open-source
package for R and S+ to analyze and compare
ROC
curves. BMC Bioinformatics, 12,
p. 77. DOI: 10.1186/1471-2105-12-77
<http://www.biomedcentral.com/1471-2105/12/77
... /> 9
6
package
pROC
The main function is roc(response, predictor).
It creates the values necessary for an ROC curve.

 response: disease status (as provided by gold


standard)

 predictor: continuous test


result (to be
dichotomized)

 For an roc object the plot(roc_obj) function


produces an ROC curve. 97
package
pROC
The function coords(roc_obj,x,best.method,ret)
calculates measures of test performance.
 x: value for which measures are calculated
(default: threshold) , x="best" gives the
optimal threshold
 best.method: if x="best", the method to determine
the best threshold (e.g. "youden")
 ret: Measures calculated. One or more of
"threshold", "specificity", "sensitivity",
"accuracy", "tn" (true negative count), "tp" (true
positive count), "fn" (false negative count), "fp"
(false positive count), "npv" (negative predictive
value), "ppv" (positive predictive value) 98
example data:
aSAH
aneurysmal subarachnoid
name
haemorrhage
label
Glasgow
Outcome
gos6 Score (GOS) 1-5
at
6 months
prediction
outcome of 'good', 'poor' to be
developme diagnozed
nt
gender sex 'male', 'female'
age age years
World Federation
of Neurological
wfns Surgeons 1-5
Score
S100
calcium binding biomark
s100b protein μg/l er continuous test 99
B result
package
pROC
> data(aSAH) # loads the data set "aSAH"
> head(aSAH)
> rocobj = roc(aSAH$outcome, aSAH$s100b)
> plot(rocobj)

> coords(rocobj, 0.55)


threshold specificity sensitivity
0.5500000 1.0000000
0.2682927
> coords(rocobj,
x="best",best.method="youden")
threshold specificity sensitivity
0.2050000 0.8055556
0.6341463
# youden threshold is 0.20; 100
according spec and sens
Measures of Test Performance

Outcomes of a diagnostic
study for a dichotomous
test result
test result

disease positive negative

present true positive false


negative
absent false positive true negative
package
pROC
>coords(rocobj,x="best",best.method="youden"
,
ret=c("threshold","specificity","sensitivity",
"tn","tp","fn","fp"))
threshold specificity sensitivity
0.2050000 0.8055556 0.6341463
tn tp fn fp
58.0000000 26.0000000 15.0000000 14.0000000
test result

disease positive negative

present tp: 26 fn: 15

absent fp:14 tn: 26 10


2
tutorial 6

R
Statistical
testing 1
statistical test
functions

name function
t.test( ) Student‘s t-test
wilcox.test( ) Wilcoxon rank sum test and signed
rank
test
ks.test( ) Kolmogorov-Smirnov test
chisq.test( ) Pearson‘s chi-squared test for count
data
mcnemar.test( McNemar test
)

104
One sample t
test
The function t.test() performs different Student‘s t
tests.

 Parameters for the one sample t


test are t.test(x,mu,alternative)
 x: numeric vector of values which shall
be tested (assumed to follow a normal
distribution)
 mu: reference value µ0
 alternative: "two.sided" (two sided alternative,
default), "less" (alternative: expectation of x is
less than µ0), "greater" (alternative: expectation 105
Blood Sugar Level and Myocardial
Infarction

A study was carried out to assess whether the


expected blood sugar level (BSL) of patients with
myocardial infarction µ is higher than the
expected BSL of control individuals, namely
µ0=100 mg/100ml.

H0: ≤0 HA:


>0
example data:
name infarct
label
nro identifier
grp group (string) 'control', 'infarct'
code coded group 0 'control', 1 'infarkt'
sex sex 1 'male', 2 'female'
age age years
height body height cm
weight body weight kg
blood sugar blood sugar level mg/100ml
diabet diabetes 0 'no', 1 'yes'
chol cholesterol level mg/100ml
trigl triglyceride level mg/100ml
cig cigarettes number of

(case/control
study)

107
One sample t
test
> setwd("C:/Users/Präsentation/MLS")
>mi = read.table("infarct
data.csv",sep=";", dec=",", header=TRUE)

>summary(mi$blood.sugar)
>summary(as.factor(mi$code))

>bloods_infarct=mi$blood.sugar[mi$code==1]
# Attention: two "="s!
# Extracts the blood sugar levels of only the
cases.

>summary(bloods_infarct)

108
One sample t
test
>t.test(bloods_infarct,mu=100,alternative="greater")

One Sample t-test

data: bloods_infarct
t = -0.7824, df = 39,
p-value = 0.7807
alternative
hypothesis: true mean
is greater than 100
95 percent confidence
interval:
90.14572 Inf
sample estimates:
mean of x
96.875
109
hands-on
example
Load the dataset from the file infarct data.csv
into the workspace.

Perform a two-sided one-sample t-test for


cholesterol level in infarct patients. The reference
value for the population is 180 mg/100ml.
What is the result of the test?

110
tutorial 7

R
Statistical
testing 2
statistical test
functions

name function
t.test( ) Student‘s t-test
wilcox.test( ) Wilcoxon rank sum test and signed
rank
test
ks.test( ) Kolmogorov-Smirnov test
chisq.test( ) Pearson‘s chi-squared test for count
data
mcnemar.test( McNemar test
)

11
2
Two sample t
test
The function t.test() performs different Student‘s t
tests.

 Parameters for the two sample t test are:

t.test(x, y, alternative, var.equal)


 x, y: numeric vectors of values which shall be
compared
(assumed to follow a normal distribution)
 alternative: "two.sided" (two sided alternative,
default), "less" (alternative: expectation of x is
less than expectation of y), "greater" (alternative:
expectation of x is larger than expectation of y)
Wilcoxon rank sum
test
The function wilcox.test() performs the Wilcoxon
rank sum test and the Wilcoxon signed rank test.

 Parameters for the Wilcoxon rank sum test are:

wilcox.test(x, y,
alternative)
 x, y: numeric vectors of values which shall be
compared (need not follow a normal distribution)

 alternative: similar to t.test


11
4
Blood Sugar Level and Myocardial
Infarction

A case-control study was carried out to assess


whether the expected blood sugar level (BSL) of
patients with myocardial infarction µ1 is higher
than the expected BSL of control individuals µ2.

H0: 1≤2 HA:


1>2
example data:
name infarct
label
nro identifier
grp group (string) 'control', 'infarct'
code coded group 0 'control', 1 'infarkt'
sex sex 1 'male', 2 'female'
age age years
height body height cm
weight body weight kg
blood sugar blood sugar level mg/100ml
diabet diabetes 0 'no', 1 'yes'
chol cholesterol level mg/100ml
trigl triglyceride level mg/100ml
cig cigarettes number of

(case/control
study)
116
Two sample t
test
> setwd("C:/Users/Präsentation/MLS")
> mi = read.table("infarct data.csv", sep=";",
dec=",", header=TRUE)

> summary(mi$blood.sugar)
> summary(as.factor(mi$code))

> bloods_infarct=mi$blood.sugar[mi$code==1]
> bloods_control=mi$blood.sugar[mi$code==0]

# Extracts the blood sugar levels of the cases


# and of the controls.

117
Two sample t
test
> t.test(bloods_infarct, bloods_control,
var.equal=TRUE, alternative="greater")

Two Sample t-test

data: bloods_infarct and bloods_control


t = 0.0305, df = 78, p-value = 0.4879
alternative hypothesis: true difference
in means is
greater than 0
95 percent confidence interval:
-13.39077 Inf
sample estimates:
mean of x mean of y
96.875
# Expected
96.625 BSL of infarct patients is not
significantly higher than expected BSL of controls. 11
8
Wilcoxon rank sum
test
> wilcox.test(bloods_infarct, bloods_control,
alternative="greater")

Wilcoxon rank sum test with continuity correction

data: bloods_infarct and bloods_control


W = 867.5, p-value = 0.2576
alternative hypothesis: true location shift is greater
than 0

# The Wilcoxon test can be applied if the BSL does not


# follow a normal distribution. Then the t test is not
# valid.

119
Pearson‘s chi-squared
test

The function chisq.test() performs a


Pearson‘s chi- squared test for count data.

chisq.test(
x)
 x: n x m table (matrix) to be
tested

120
example data: low birth
weight
name text variable type
low low birth weight of the baby nominal: 0 'no >=2500g' 1 'yes <2500g'
age age of mother continuous: years
lwt mother's weight at last period continuous: pounds
race ethnicity nominal: 1 'white' 2 'black' 3 'other'
smoke smoking status nominal: 0 'no' 1 'yes'
ptl premature labor discrete: number of
ht hypertension nominal: 0 'no' 1 'yes'
ui presence of uterine irritability nominal: 0 'no' 1 'yes'
ftv physician visits in first discrete: number of
trimester
bwt birthweight of the baby continous: g

121
Pearson‘s chi-squared
test
> library(MASS)
> tab_bw_smok=table(birthwt$low, birthwt$smoke)
> tab_bw_smok

0 1
0 86 44
1 29 30
> chisq.t
est(tab
_bw_smo
k)

Pearson's Chi-squared test with Yates'


continuity correction

# The probability of having a baby with low birth


data: tab_bw_smok
# weight is significantly
X-squared higher
= 4.2359, df = 1,for smoking
p-value = mothers. 12
2
hands-on
example
loading: library(mlbench),
the dataframe is called PimaIndiansDiabetes2.

Plot a histogram of the variable insulin.


Compare the insulin values between cases and
controls (variable diabetes) using an appropriate
test.

12
3
tutorial 8

R
Correlation and
linear regression,
low level graphics
Correlatio
n
The function cor(x, y, method) computes the
correlation between two paired random
variables.

 x, y: numeric vectors of values for which the


correlation shall be calculated (must have the
same length)

 method: "pearson", "spearman" or "kendall"

125
Test of
correlation
The function cor.test(x, y, alternative, method)
tests for correlation between paired random
variables.

 x, y: numeric vectors of values for which the


correlation shall be tested (must have the same
length)
 alternative:
"two.sided" (alternative: correlation coefficient
≠ 0,
default),
"less" (alternative: negative correlation),
126
Linear regression
(simple)

The function lm(formula, data) fits a linear model


to data.

 formula: y~x with y response variable and


x explanatory variable (must have the
same length)

 data: optional, if not specified in


formula, the dataframe containing x
and y

127
example data:
name infarct
label
nro identifier
grp group (string) 'control', 'infarct'
code coded group 0 'control', 1 'infarkt'
sex sex 1 'male', 2 'female'
age age years
height body height cm
weight body weight kg
blood sugar blood sugar level mg/100ml
diabet diabetes 0 'no', 1 'yes'
chol cholesterol level mg/100ml
trigl triglyceride level mg/100ml
cig cigarettes number
of
(case/control
study)

128
Correlatio
n

> setwd("C:/Users/Präsentation/MLS")
> mi = read.table("infarct data.csv", sep=";",
dec=",", header=TRUE)

> plot(x=mi$height, y=mi$weight)

> cor(mi$height, mi$weight, method="pearson")


[1] 0.6307697
> cor(mi$height, mi$weight, method="spearman")
[1] 0.6281738

129
Correlatio
n

130
Correlatio
n
> cor.test(mi$height,mi$weight,method="pearson")

Pearson's product-moment correlation

data: mi$height and mi$weight


t = 7.1792, df = 78, p-value = 3.586e-10
alternative hypothesis: true correlation is not
equal to 0
95 percent confidence interval:
0.4771865 0.7469643
sample estimates:
cor
0.6307697

# Significant correlation between body


height and # body weight
131
Linear
regression
> lm(mi$weight~mi$height)

Call:
lm(formula = mi$weight ~ mi$height)

Coefficients:
(Intercept) mi$height
-51.2910 0.7477

# Y = a + b × x + E
# with Y: body weight, x: body height,
# a=-51.29, b=0.75

132
graphic
s
R has extensive graphics
facilities. Graphic functions are
differentiated in
 high-level graphics functions
 low-level graphics functions

The quality of the graphs produced by R is


often cited as a major reason for using it in
preference to other statistical software
systems.
133
low-level
graphics

Plots produced by high-level graphics


facilities can be modified by low-level
graphics commands.

134
low-level
functions
name function
points(x, y) adds points (the option type= can be used)
lines(x, y) adds lines (the option type= can be used)
text(x, y, labels, ...) adds text given by labels at coordinates (x,y);
a typical use is: plot(x, y, type="n"); text(x, y,

names)
abline(a, b) draws a line of slope b and intercept a
abline(h=y) draws a horizontal line at ordinate y
abline(v=x) draws a vertical line at abcissa x
rect(x1, y1, x2, y2) draws a rectangle whose left, right, bottom,
and top
limits are x1, x2, y1, and y2, respectively
polygon(x, y) draws a polygon with coordinates given by x
and y
title( ) adds a title and optionally a sub-title
135
low-level
functions

> plot(x=mi$height, y=mi$weight)


> abline(a=-51.29, b=0.75, col="blue")
# Adds the regression line to the scatter plot.

> title("Regression of weight and height")


> text(x=185, y=65, labels="Kieler Woche",
col="green")

136
low-level
functions

Kieler
Woche

137
hands-on
example
Load the dataset from the file correlation.csv into
the workspace.

Calculate the Pearson correlation coefficient


between the variables x and y and test whether this
coefficient is significantly different from 0.
Generate a scatter plot.

138
tutorial 9

R
Regression
models
Linear regression
(simple)

The function lm(formula, data) fits a linear model


to data.

 formula: y~x with y response variable and


x explanatory variable (must have the
same length)

 data: optional, if not specified in


formula, the dataframe containing x
and y

140
Linear regression
(multiple)

The function lm(formula, data) fits a linear model


to data.

 formula: y~x1+x2+…+xk with y response


variable and x1,…,xk explanatory variables
(must have the same length)

 data: optional, if not specified in


formula, the dataframe containing x1,
…,xk and y
141
Generalised linear
model

The function glm(formula, family) fits a


generalised linear model to data.

 formula: y~x1+x2+…+xk with y response


variable and x1,…,xk explanatory variables
(must have the same length)

 family: specifies the link function;


choose family=binomial for the
logistic regression
142
example data:
infarct
name label
nro identifier
grp group (string) 'control', 'infarct'
code coded group 0 'control', 1 'infarct'
sex sex 1 'male', 2 'female'
age age years
height body height cm
weight body weight kg
blood sugar blood sugar level mg/100ml
diabet diabetes 0 'no', 1 'yes'
chol cholesterol level mg/100ml
trigl triglyceride level mg/100ml
cig cigarettes number
of
(case/control
study) 143
Generalised linear
model

> setwd("C:/Users/Präsentation/MLS")
> mi = read.table("infarct data.csv", sep=";",
dec=",", header=TRUE)

>model_mi=glm(mi$code~mi$sex+mi$age+
mi$height+mi$weight+mi$blood.sugar+mi$diabet
+mi$chol+mi$trigl+mi$cig,family=binomial
)

> summary(model_mi)

144
Generalised linear
model
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -34.60297 12.51757 -2.764 0.005704 **
mi$sex 0.23048 0.90885 0.254 0.799810
mi$age 0.10734 0.04161 2.580 0.009883 **
mi$height 0.14930 0.07838 1.905 0.056799 .
mi$weight -0.11508 0.06304 -1.826 0.067916 .
mi$blood.sugar -0.02246 0.01399 -1.605 0.108425
mi$diabet 2.05732 2.15947 0.953 0.340743
mi$chol 0.07294 0.02188 3.334 0.000855
***
mi$trigl -0.01936 0.01227 -1.578 0.114638
mi$cig 0.07686 0.04695 1.637 0.101603
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’
0.1 ‘ ’ 1
145
Generalised linear
model
> model_mi=glm(mi$code~mi$age+mi$chol,family=binomial)
> summary(model_mi)

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -16.13858 3.78005 -4.269 1.96e-05 ***
mi$age 0.08404 0.03255 2.582 0.009827 **
mi$chol 0.05564 0.01569 3.546 0.000391 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’
0.1 ‘ ’ 1

# Model after backward selection

146
hands-on
example
loading: library(mlbench),
the dataframe is called PimaIndiansDiabetes2.

Perform a linear regression with the variable insulin


as response and variables glucose, pressure, mass
and triceps as explanatory variables.
Apply a backwards selection to generate a reduced
model.

147
tutorial
10

R
S
u
r
v
i
Survival
object

Before performing analysis the function


Surv(time, event) has to create a survival object.

 time:
if event occured: time of the event
if no event occured: last observation
time Since start of study (survival time)
 event:
1: event
0: no event
Important: has to be numeric. 149
Survival
curves

The function survfit(formula, data) creates an


estimated survival curve. Afterwards use the plot
command.

 formula:
y~ Let
for ya be a Surv object.
Kaplan-Meier curve
1 for several Kaplan-Meier curves
y~ stratified by x
x
 data: optional, if not specified in formula, the
dataframe containing x and y

150
Log-Rank
Test
The function survdiff(formula, rho, data) tests if
there is a difference between two or more survival
curves.

 formula: y~x
with y: Surv
object
x: group or
stratifying
variable
 rho: a scalar parameter that controls the type of
test. rho=0 (default) for the Log-Rank Test 151
example data:
survival

name

therapy two chemotherapies: C1 and C2

time if death occured: time of death


if no death occured: last observation time

event 1: death
0: no death

152
Survival
analysis
> install.packages("survival")
> library(survival)
> setwd("C:/Users/Präsentation/MLS")
> cancer=read.table("survival.csv",dec=",",sep=";",
header=TRUE)
> head(cancer)

> surv_object=Surv(time=cancer$time,
event=cancer$event)
> curve=survfit(surv_object~1)
> summary(curve)
> plot(curve)

# One Kaplan-Meier curve for both


therapies combined
# (with confidence bands) 153
Survival
analysis

154
Survival
analysis

> surv_object=Surv(time=cancer$time,
event=cancer$event)
> curve=survfit(surv_object~cancer$t
herapy)
> summary(curve)
> plot(curve,lty=1:2)
> legend("topright",levels(cancer$th
erapy),lty=1:2)

# Two Kaplan-Meier curves, one for


each therapy
# group (without confidence
bands)

155
Survival
analysis

156
Survival
analysis

>survdiff(surv_object~cancer$therapy,rho=0
) Call:
survdiff(formula = surv_object ~
cancer$therapy,rho = 0)
N Obs Expected (O-E)^2/E (O-E)^2/V
cancer$therapy=C1 10 6 4.07 0.919 1.56
cancer$therapy=C2 10 6 7.93 0.471 1.56

Chisq= 1.6 on 1 degrees of freedom, p=


0.211

# No significant difference between the


# survival functions of the two therapies
157
hands-on
example
loading: library(survival),
the dataframe is called retinopathy.

In this dataframe the variable futime is the time


variable and the variable status the event
variable.

Plot a Kaplan-Meier curve.

158

You might also like