Module 3 R Data Science
Module 3 R Data Science
R
Introduction and
descriptive
statistics
what is
R
R is open source.
2
what is
R
R is an object oriented
programming language.
Everything in R is an object.
3
how to get
R
Homepage CRAN:
Comprehensive R
Archive Network
http://www.r-project.org/
4
5
how to edit
R
Editor RStudio
http://www.rstudio.
com
6
RStudi
o
http://www.rstudio.co 8
m
how R
works
9
using R as a
calculator
Users type expressions to the R interpreter.
10
arithmetic
operators
+ addition
arithmetic - subtraction
results * multiplication
in
/ division
numeri
c ^ raise to power
value(s
)
11
logical
operators
type operator action performed
13
logical
operators
> 4 < 3
[1] FALSE
> 2^3 == 9
[1] FALSE
> (3 + 1) != 3
[1] TRUE
14
assignmen
t
Values are stored by assigning them a name.
The statements
> z = 17
> z <- 17
> 17 -> z
15
data
types
> a = (1 + 1 == 3) # logical
> a
[1] FALSE
> mode(a)
[1] "logical" 1
7
data
structures
18
creating
vectors
The function c( ) can combine several elements into
vectors.
> x = c(1, 3, 5, 7, 8, 9) # numerical vector
> x
[1] 1 3 5 7 8 9
> x = c(1, 2, 3, 4)
> c(x, 10)
[1] 1 2 3 4 10
> c(x, x)
[1] 1 2 3 4 1 2 3 4
> 5:-5
[1] 5 4 3 2 1 0 -1 -2 -3 -4 -5
> y = 1:11
> y
[1] 1 2 3 4 5 6 7 8 9 10 11
2
1
extracting
elements
> x = c(1, 3, 5, 7, 8, 9)
> x[3] # extract 3rd
position
[1] 5
> x[1:3] # extract positions 1-3
[1] 1 3 5
23
data
frame
In a data frame the column labels are the vector
names.
> fix(data)
25
example data: low birth
weight
name text variable type
low low birth weight of the baby nominal: 0 'no >=2500g' 1 'yes <2500g'
age age of mother continuous: years
lwt mother's weight at last period continuous: pounds
race ethnicity nominal: 1 'white' 2 'black' 3 'other'
smoke smoking status nominal: 0 'no' 1 'yes'
ptl premature labor discrete: number of
ht hypertension nominal: 0 'no' 1 'yes'
ui presence of uterine irritability nominal: 0 'no' 1 'yes'
ftv physician visits in first discrete: number of
trimester
bwt birthweight of the baby continous: g
The birthweight data frame has 189 rows and 10 columns. The
data were collected at Baystate Medical Center, Springfield,
Mass during 1986. 26
example data: low birth
weight
loading: library(MASS), the dataframe is called
birthwt.
dim(birthwt)
summary(birthwt)
head(birthwt)
str(birthwt)
27
extracting
vectors
> birthwt$age
> birthwt$age[33]
> birthwt$age[1:10]
28
some functions in
R
name function
summary(x) summary statistics of the elements
of x
max(x) maximum of the elements of x
min(x) minimum of the elements of x
sum(x) sum of the elements of x
mean(x) mean of the elements of x
sd(x) standard deviation of the elements
of x
median(x) median of the elements of x
quantile(x, probs=…) quantiles of the elements of x
sort(x) ordering the elements of x 29
some functions in
R
> mean(birthwt$age)
[1] 23.2381
> max(birthwt$age)
[1] 45
> min(birthwt$age)
[1] 14
30
getting
help
to get help on thesd() function you can type
either of
> help(sd)
> ?sd
31
sorting
vectors
Sorting / ordering of data in vectors
with the function sort()
> help(sort)
> x=sort(birthwt$age, decreasing=FALSE)
> x[1:10]
[1] 14 14 14 15 15 15 16 16 16 16
> hist(birthwt$age)
> boxplot(birthwt$age)
35
hands-on
example
loading: library(mlbench),
the dataframe is called PimaIndiansDiabetes2.
R
Graphics and
probability theory
graphic
s
R has extensive graphics
facilities.
Graphic functions are
differentiated in
high-level graphics functions
low-level graphics functions
41
plot
function
To plot points with x and y coordinates or two
random variables for a data set (one on the x axis,
the other on the y axis; called a scatterplot) , type:
> a = c(1,2,3,4)
> b = c(4,4,0,5)
> plot(x=a,y=b)
> plot(a,b) # the same
42
plot
function
To plot points with x and y coordinates or two
random variables for a data set (one on the x axis,
the other on the y axis; called a scatterplot), type:
> library(MASS)
> plot(x=birthwt$age,y=birthwt$lwt)
# lwt: mothers weight in pounds
> plot(x=birthwt$age[1:10],y=birthwt$lwt[1:10])
# first 10 mothers
43
plot
function
Another example:
44
plot
function
Parameters in the plot()
function are (see help(plot) and
help(par)):
x x-coordinate(s)
y y-coordinates (optional, depends
main, on x) title and subtitle
sub axes labels
xlab, range of values for x
ylab and y type of plot
xlim, type of
ylim lines plot
type symbol
lty scale
pch factor 4
5
plot symbol / line
type
plot type: plot symbol:
type=
“p point
‘‘ s pch=
‘‘l‘ lines
‘ both
‘‘b steps
“ verti
‘‘s cal
“ lines
‘‘h nothi
line lty
“ ng
‘‘n type: =
“
…
46
plot
function
> a = seq(-5, +5, by=0.2)
> b = a^2
> plot(a, b)
> plot(a,b,main="quadratic function")
> plot(a,b,main="quadratic function",cex=2)
> plot(a,b,main="quadratic function",col="blue")
47
plot
function
> a = seq(-5, +5, by=0.2)
> b = a^2
> plot(a,b,main="quadratic function",type="l")
> plot(a,b,main="quadratic function",type="b")
> plot(a,b,main="quadratic function",pch=2)
quadratic function
2
5
2
0
1
5
y
1
0
5
0
-4 -2 0 2 4
4
x 8
probability theory,
factorials
> choose(8,5)
[1] 56
49
functions for random
variables
Distributions can be easily calculated or simulated
using R.
The functions are named such that the first
letter states what the function calculates or
simulates
d = density function (probability
function)
p = distribution function
q = quantile (inverse distribution)
r = random number generation
and the last part of the name of the function
specifies the type of distribution, e.g.
binomial dististribution 50
binomial
distribution
Probability
function:
n k nk
f(k) P(X (1
k) k )
x k
size n
prob π
51
normal
distribution
Density
function:
2
1
( x)
f(x) e 22
2
52
normal
distribution
Calculating the probability density
function:
> dnorm(x=2, mean=6, sd=2)
[1] 0.02699548
0.2
0
0.1
5
f(x
)
0.1
0
0.0 0 2 4 6 8 1 1
5 0 2
x
0.0 53
0
normal
distribution
Distribution
function:
b
f(x)
F(b) 'density'
f(x)dx x
b
54
normal
distribution
Distribution
function:
N(10,25) f(x)
> pnorm(q=13, mean=10, sd=5)
distribution 'density'
[1] 0.7257469 x
1
3
55
binomial
distribution
Probability function:
[1] 0.1072481
56
normal
distribution
Plotting the density of a N(5,49)
distribution:
57
hands-on
example
loading: library(mlbench),
the dataframe is called PimaIndiansDiabetes2.
random variable X
distributed according to Bin(4,0.85)?
Calculate all possible values of the probability
function of X. Plot the probability function of X with
the possible realizations of X on the x axis and the
corresponding values of the probability function on 58
tutorial 3
R
Random numbers
and factors
functions for random
variables
Distributions can be easily calculated or simulated
using R.
The functions are named such that the first
letter states what the function calculates or
simulates
d = density function (probability
function)
p = distribution function
q = quantile (inverse distribution)
r = random number generation
and the last part of the name of the function
specifies the type of distribution, e.g.
binomial dististribution 6
0
binomial
distribution
Generating random
realizations:
n k nk
f(k) P(X k) (1
k )
n: number of samples to
draw size: n
prob=π
output: number of 6
1
normal
distribution
Generating random
realizations: 2
1
( x)
f(x) e 22
2
• rnorm(n, mean, sd)
n: number of samples to
draw
62
t
distribution
Quantiles:
• qt(p, df)
p: quantile
probability
df: degrees of
freedom
63
binomial
distribution
Generating random realizations:
[1] 7
64
binomial
distribution
Generating random realizations:
# Generating 10 samples
[1] 14 10 6 12 8 6 7 10 5 9
65
normal
distribution
> values=rnorm(10, mean=0, sd=1)
> values
[1] -0.56047565 -0.23017749 1.55870831 0.07050839
0.12928774 1.71506499 0.46091621 -1.26506123
-0.68685285 -0.44566197
# 10 simulations from a N(0,1) distribution
> mean(values)
[1] 0.07462565
66
t
distribution
Quantiles:
• qt(p, df)
p: quantile probability
df: degrees of freedom
> qt(p=0.95,df=9)
[1] 1.833113
> qt(p=0.95,df=99)
[1] 1.660391
> qnorm(p=0.95,mean=0,sd=1)
[1] 1.644854
> qt(p=0.975,df=99)
[1] 1.984217 = for α=0.05, n=100
t1α / 2,n1
67
object
classes
All objects in R have a class. The class attribute
allows R to treat objects differently (e.g. for
summary() or plot()).
Possible classes are:
numeric
logical
character
list The class is shown
by the class()
matrix
function.
data.fra
me
array
factor 68
factors
69
factors
• factor(x,levels,labels)
• as.factor(x)
70
factors
> summary(smoke)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0 0.0 0.0 0.4 1.0 1.0
> class(smoke)
[1] "numeric"
71
factors
> smoke_new=factor(smoke)
> smoke_new
[1] 0 0 1 1 0 0 0 1 0 1
Levels: 0 1
> summary(smoke_new)
0 1
6 4
> class(smoke_new)
[1] "factor"
72
factors
> smoke_new=factor(smoke,levels=c(0,1))
> smoke_new
[1] 0 0 1 1 0 0 0 1 0 1
Levels: 0 1
> smoke_new=factor(smoke,levels=c(0,1,2))
> smoke_new
[1] 0 0 1 1 0 0 0 1 0 1
Levels: 0 1 2
> summary(smoke_new)
0 1 2
6 4 0
73
factors
> smoke_new=factor(smoke,levels=c(0,1),
labels=c("no", "yes")
> smoke_new
[1] no no yes yes no no yes no yes
no
Levels: no yes
> summary(smoke_new)
no yes
6 4
74
factors
> library(MASS)
> summary(birthwt$race)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 1.000 1.000 1.847 3.000 3.000
> race_new=as.factor(birthwt$race)
> summary(race_new)
1 2 3
96 26 67
> levels
(race_
new)
[1] "1"
"2" "3"
> levels
(race_ 75
hands-on
example
Sample 20 realizations of a N(0,1) distribution.
R
Reading data from
files, frequency
tables
functions for random
variables
Distributions can be easily calculated or simulated
using R.
The functions are named such that the first
letter states what the function calculates or
simulates
d = density function (probability
function)
p = distribution function
q = quantile (inverse distribution)
r = random number generation
and the last part of the name of the function
specifies the type of distribution, e.g.
binomial dististribution 78
normal
distribution
Quantiles:
p: quantile probability
> qnorm(p=0.95,mean=0,sd=1)
[1] 1.644854 = z
> qnorm(p=0.975,mean=0,sd=1) 0.95
[1] 1.959964 for α=0.05
= z1α /
2
79
reading data: working
directory
For reading or saving files, a simple file name
identifies a file in the working directory. Files in
other places can be specified by the path name.
80
reading
data
81
reading
data
Optional arguments to read.table() which
can be used to change its behaviour.
Setting header=TRUE indicates to R that the
first row of the data file contains names for
each of the columns.
The argument skip= makes it possible to
skip the specified number of lines at the top
of the file.
The argument sep= can be used to specify a
character which separates columns. (Use
sep=";" for csv files.)
The argument dec= can be used to specify a 82
example data:
infarct
name label
nro identifier
grp group (string) 'control', 'infarct'
code coded group 0 'control', 1 'infarct'
sex sex 1 'male', 2 'female'
age age years
height body height cm
weight body weight kg
blood sugar blood sugar level mg/100ml
diabet diabetes 0 'no', 1 'yes'
chol cholesterol level mg/100ml
trigl triglyceride level mg/100ml
cig cigarettes number
of
(case/control
study) 83
example data:
infarct
> setwd("C:/Users/Präsentation/MLS")
> mi = read.table("infarct data.csv")
Error in scan(file, what,...: line 2 did not have 2
elements # wrong separator
84
frequency
tables
table(var1, var2) gives a table of the
absolute frequencies of all combinations of
var1 and var2. var1 and var2 have to attain a
finite number of values (frequency table,
cross classification table, contingency table).
var1 defines the rows, var2 the columns.
addmargins(table) adds the sums of rows
and
columns.
prop.table(table) gives the relative
frequencies, overall or with respect to
rows or columns.
85
frequency
tables
> grp_sex=table(mi$grp,mi$sex)
> grp_sex
1 2
control 25 15
infarct 28 12
> addmargins(grp_sex)
1 2 Sum
control 25 15 40
infarct 28 12 40
Sum 53 27 80
86
frequency
tables
> prop.table(grp_sex)
1 2
control 0.3125 0.1875
infarct 0.3500 0.1500
> prop.table(grp_sex,margin=1)
1 2
control 0.625 0.375
infarct 0.700 0.300 # rows sums to 1
> prop.table(grp_sex,margin=2)
1 2
control 0.4716981 0.5555556
infarct 0.5283019 0.4444444 # columns sum to 1
87
hands-on
example
Load the dataset from the file bdendo.csv
into the workspace.
88
tutorial 5
R
Installing
packages, the
package "pROC"
R
packages
R consists of a base level of functionality
together with a set of contributed libraries
which provide extended capabilities.
The key idea is that of a package which
provides a related set of software
components, documentation and data sets.
Packages can be installed into R. This
needs administrator rights.
90
pROC – diagnostic
testing
Package: pROC
Type: Package
Title: display and analyze ROC
curves Version: 1.7.1
Date: 2014-02-20
Encoding: UTF-8
Depends: R (>= 2.13)
Imports: plyr, utils, methods, Rcpp (>= 0.10.5)
Suggests: microbenchmark, tcltk, MASS, logcondens,
doMC, doSNOW
LinkingTo: Rcpp
Author: Xavier Robin, Natacha Turck, Alexandre Hainard,
Natalia Tiberti, Frédérique Lisacek, Jean-Charles
Sanchez and Markus Müller.
Maintainer: Xavier Robin <[email protected]>
91
installing
packages
You can install R packages using the install.packages()
command.
> install.packages("pROC")
Installing package(s) into
‘C:/Users/Amke/Documents/R/win-library/2.15’
(as ‘lib’ is unspecified)
downloaded 827 Kb
package ‘pROC’ successfully unpacked and MD5 sums
checked
93
94
using installed
packages
95
cite
packages
Outcomes of a diagnostic
study for a dichotomous
test result
test result
R
Statistical
testing 1
statistical test
functions
name function
t.test( ) Student‘s t-test
wilcox.test( ) Wilcoxon rank sum test and signed
rank
test
ks.test( ) Kolmogorov-Smirnov test
chisq.test( ) Pearson‘s chi-squared test for count
data
mcnemar.test( McNemar test
)
104
One sample t
test
The function t.test() performs different Student‘s t
tests.
(case/control
study)
107
One sample t
test
> setwd("C:/Users/Präsentation/MLS")
>mi = read.table("infarct
data.csv",sep=";", dec=",", header=TRUE)
>summary(mi$blood.sugar)
>summary(as.factor(mi$code))
>bloods_infarct=mi$blood.sugar[mi$code==1]
# Attention: two "="s!
# Extracts the blood sugar levels of only the
cases.
>summary(bloods_infarct)
108
One sample t
test
>t.test(bloods_infarct,mu=100,alternative="greater")
data: bloods_infarct
t = -0.7824, df = 39,
p-value = 0.7807
alternative
hypothesis: true mean
is greater than 100
95 percent confidence
interval:
90.14572 Inf
sample estimates:
mean of x
96.875
109
hands-on
example
Load the dataset from the file infarct data.csv
into the workspace.
110
tutorial 7
R
Statistical
testing 2
statistical test
functions
name function
t.test( ) Student‘s t-test
wilcox.test( ) Wilcoxon rank sum test and signed
rank
test
ks.test( ) Kolmogorov-Smirnov test
chisq.test( ) Pearson‘s chi-squared test for count
data
mcnemar.test( McNemar test
)
11
2
Two sample t
test
The function t.test() performs different Student‘s t
tests.
wilcox.test(x, y,
alternative)
x, y: numeric vectors of values which shall be
compared (need not follow a normal distribution)
(case/control
study)
116
Two sample t
test
> setwd("C:/Users/Präsentation/MLS")
> mi = read.table("infarct data.csv", sep=";",
dec=",", header=TRUE)
> summary(mi$blood.sugar)
> summary(as.factor(mi$code))
> bloods_infarct=mi$blood.sugar[mi$code==1]
> bloods_control=mi$blood.sugar[mi$code==0]
117
Two sample t
test
> t.test(bloods_infarct, bloods_control,
var.equal=TRUE, alternative="greater")
119
Pearson‘s chi-squared
test
chisq.test(
x)
x: n x m table (matrix) to be
tested
120
example data: low birth
weight
name text variable type
low low birth weight of the baby nominal: 0 'no >=2500g' 1 'yes <2500g'
age age of mother continuous: years
lwt mother's weight at last period continuous: pounds
race ethnicity nominal: 1 'white' 2 'black' 3 'other'
smoke smoking status nominal: 0 'no' 1 'yes'
ptl premature labor discrete: number of
ht hypertension nominal: 0 'no' 1 'yes'
ui presence of uterine irritability nominal: 0 'no' 1 'yes'
ftv physician visits in first discrete: number of
trimester
bwt birthweight of the baby continous: g
121
Pearson‘s chi-squared
test
> library(MASS)
> tab_bw_smok=table(birthwt$low, birthwt$smoke)
> tab_bw_smok
0 1
0 86 44
1 29 30
> chisq.t
est(tab
_bw_smo
k)
12
3
tutorial 8
R
Correlation and
linear regression,
low level graphics
Correlatio
n
The function cor(x, y, method) computes the
correlation between two paired random
variables.
125
Test of
correlation
The function cor.test(x, y, alternative, method)
tests for correlation between paired random
variables.
127
example data:
name infarct
label
nro identifier
grp group (string) 'control', 'infarct'
code coded group 0 'control', 1 'infarkt'
sex sex 1 'male', 2 'female'
age age years
height body height cm
weight body weight kg
blood sugar blood sugar level mg/100ml
diabet diabetes 0 'no', 1 'yes'
chol cholesterol level mg/100ml
trigl triglyceride level mg/100ml
cig cigarettes number
of
(case/control
study)
128
Correlatio
n
> setwd("C:/Users/Präsentation/MLS")
> mi = read.table("infarct data.csv", sep=";",
dec=",", header=TRUE)
129
Correlatio
n
130
Correlatio
n
> cor.test(mi$height,mi$weight,method="pearson")
Call:
lm(formula = mi$weight ~ mi$height)
Coefficients:
(Intercept) mi$height
-51.2910 0.7477
# Y = a + b × x + E
# with Y: body weight, x: body height,
# a=-51.29, b=0.75
132
graphic
s
R has extensive graphics
facilities. Graphic functions are
differentiated in
high-level graphics functions
low-level graphics functions
134
low-level
functions
name function
points(x, y) adds points (the option type= can be used)
lines(x, y) adds lines (the option type= can be used)
text(x, y, labels, ...) adds text given by labels at coordinates (x,y);
a typical use is: plot(x, y, type="n"); text(x, y,
names)
abline(a, b) draws a line of slope b and intercept a
abline(h=y) draws a horizontal line at ordinate y
abline(v=x) draws a vertical line at abcissa x
rect(x1, y1, x2, y2) draws a rectangle whose left, right, bottom,
and top
limits are x1, x2, y1, and y2, respectively
polygon(x, y) draws a polygon with coordinates given by x
and y
title( ) adds a title and optionally a sub-title
135
low-level
functions
136
low-level
functions
Kieler
Woche
137
hands-on
example
Load the dataset from the file correlation.csv into
the workspace.
138
tutorial 9
R
Regression
models
Linear regression
(simple)
140
Linear regression
(multiple)
> setwd("C:/Users/Präsentation/MLS")
> mi = read.table("infarct data.csv", sep=";",
dec=",", header=TRUE)
>model_mi=glm(mi$code~mi$sex+mi$age+
mi$height+mi$weight+mi$blood.sugar+mi$diabet
+mi$chol+mi$trigl+mi$cig,family=binomial
)
> summary(model_mi)
144
Generalised linear
model
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -34.60297 12.51757 -2.764 0.005704 **
mi$sex 0.23048 0.90885 0.254 0.799810
mi$age 0.10734 0.04161 2.580 0.009883 **
mi$height 0.14930 0.07838 1.905 0.056799 .
mi$weight -0.11508 0.06304 -1.826 0.067916 .
mi$blood.sugar -0.02246 0.01399 -1.605 0.108425
mi$diabet 2.05732 2.15947 0.953 0.340743
mi$chol 0.07294 0.02188 3.334 0.000855
***
mi$trigl -0.01936 0.01227 -1.578 0.114638
mi$cig 0.07686 0.04695 1.637 0.101603
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’
0.1 ‘ ’ 1
145
Generalised linear
model
> model_mi=glm(mi$code~mi$age+mi$chol,family=binomial)
> summary(model_mi)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -16.13858 3.78005 -4.269 1.96e-05 ***
mi$age 0.08404 0.03255 2.582 0.009827 **
mi$chol 0.05564 0.01569 3.546 0.000391 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’
0.1 ‘ ’ 1
146
hands-on
example
loading: library(mlbench),
the dataframe is called PimaIndiansDiabetes2.
147
tutorial
10
R
S
u
r
v
i
Survival
object
time:
if event occured: time of the event
if no event occured: last observation
time Since start of study (survival time)
event:
1: event
0: no event
Important: has to be numeric. 149
Survival
curves
formula:
y~ Let
for ya be a Surv object.
Kaplan-Meier curve
1 for several Kaplan-Meier curves
y~ stratified by x
x
data: optional, if not specified in formula, the
dataframe containing x and y
150
Log-Rank
Test
The function survdiff(formula, rho, data) tests if
there is a difference between two or more survival
curves.
formula: y~x
with y: Surv
object
x: group or
stratifying
variable
rho: a scalar parameter that controls the type of
test. rho=0 (default) for the Log-Rank Test 151
example data:
survival
name
event 1: death
0: no death
152
Survival
analysis
> install.packages("survival")
> library(survival)
> setwd("C:/Users/Präsentation/MLS")
> cancer=read.table("survival.csv",dec=",",sep=";",
header=TRUE)
> head(cancer)
> surv_object=Surv(time=cancer$time,
event=cancer$event)
> curve=survfit(surv_object~1)
> summary(curve)
> plot(curve)
154
Survival
analysis
> surv_object=Surv(time=cancer$time,
event=cancer$event)
> curve=survfit(surv_object~cancer$t
herapy)
> summary(curve)
> plot(curve,lty=1:2)
> legend("topright",levels(cancer$th
erapy),lty=1:2)
155
Survival
analysis
156
Survival
analysis
>survdiff(surv_object~cancer$therapy,rho=0
) Call:
survdiff(formula = surv_object ~
cancer$therapy,rho = 0)
N Obs Expected (O-E)^2/E (O-E)^2/V
cancer$therapy=C1 10 6 4.07 0.919 1.56
cancer$therapy=C2 10 6 7.93 0.471 1.56
158