CS1 R Summary Sheets
CS1 R Summary Sheets
Functions
x! factorial(x)
n
choose(n,x)
x
Γ(x) gamma(x)
ex exp(x)
ln x log(x)
x sqrt(x)
xn x^n
Statistics
Given a data set stored in a vector x:
n length(x)
x sum(x)
x2 sum(x^2)
x mean(x)
median median(x)
(x − x )2 sum((x-mean(x))^2)
sd sd(x)
variance var(x)
Q1 quantile(x,0.25)
Q3 quantile(x,0.75)
Logical Tests
not equal !=
equal ==
greater than >
greater than or equal >=
or |
and &
table(<object>)
barplot(<heights>,names=<x values>)
Histogram
hist(<object>)
Additional options:
prob=TRUE histogram of probabilities (or probability densities) instead of frequencies
breaks= specifies where you want the breaks between the bars to be.
useful for discrete data so bars are centred on values
eg breaks=-0.5:5.5 gives breaks at -0.5,0.5,1.5,…,4.5,5.5
Graphs
plot(x,y) or plot(<object>)
By default will plot points (ie a scattergraph), otherwise need to specify the type:
Additional options:
main="…" title of graph
xlab="…" x-axis label
ylab="…" y-axis label
pch= plot character (ie what character we use for the point)
lines(x,y) adds line to existing plot, has same options for line types, widths and colours.
points(x,y) adds points to existing plot, has same options for points character and colours.
Plot y=a+bx
QQ Plots
qqnorm(x) plots quantiles of sample (on vertical) against normal (on horizontal)
qqline(x) adds a diagonal line to qqnorm to show “true” position
qqplot(sim,x) plots quantiles of sample (on vertical) against simulations of theoretical distribution
(on horizontal), eg sim <- rgamma(1000,3,2)
abline(0,1) adds the correct diagonal line to a qqplot
Overview
All of these functions have the form:
<letter><distribution>(<x>, <parameters>)
where:
d <distribution>(x , <parameters>) gives the value of the PDF/PF at x, ie f (x) or P( X = x)
p <distribution>(x , <parameters>) gives CDF F (x) = P( X ≤ x) for x
option: lower=FALSE to give P( X > x)
q<distribution>(p , <parameters>) gives smallest value of x such that F (x) = P( X ≤ x) = p or greater
option: lower=FALSE to give largest value of x such that P( X > x) = p
r <distribution>(n , <parameters>) gives n random simulations from the distribution
Discrete Distributions
The names and parameters for discrete distributions are:
Binomial binom(…, <n>,<p>)
Poisson pois(…, <mu>)
Type 2 geometric geom(…, <p>)
Type 2 Negative Binomial nbinom(…, <k>,<p>)
Hypergeometric
hyper(…,<success in pop>, <failure in pop>,<sample size>)
Continuous Distributions
The names and parameters for continuous distributions are:
Exponential exp(…, <lambda>)
Gamma gamma(…, <alpha>,<lambda>)
Chi square chisq(…, <dof>)
Uniform unif(…, <min>,<max>)
Beta beta(…, <alpha>,<beta>)
Normal norm(…, <mean>,<sd>)
Log normal lnorm(…, <mu>,<sigma>)
t t(…, <dof>)
F f(…, <dof1>,<dof2>)
( )
X N μ , σn = N 5, 10
2
( )
5 = N(5,0.5)
For this particular seed value we get a mean of 4.9899 and variance of 0.5091171.
We can plot a histogram (with height up to 0.6) of the probabilities and superimpose (in red) the empirical
density function:
hist(xbar,ylim=c(0,0.6),prob=TRUE)
lines(density(xbar),col="red")
qqnorm(xbar)
qqline(xbar)
We can calculate probabilities, say P( X ≤ 4) , empirically and using the Central Limit Theorem:
length(xbar[xbar<=4])/length(xbar)
pnorm(4,5,sqrt(0.5))
qqnorm
#Close to normal in the middle and fairly good in upper tail
#However, 'banana shape' indicates skewness
#Since sample quantiles above the line in both tails - they need to be lower to match norm
#seriously lighter lower tail - as not as low as should be
#slightly heavier upper tail - as higher than expected
#lighter lower tail and heavier upper tail indicates positive skew
E ( X |λ ) = λ and var( X |λ ) = λ
Hence:
x <- rep(0,10000)
set.seed(24)
for (i in 1:10000)
{lambda <- rgamma(1,5,2)
x[i] <- rpois(1,lambda)}
mean(x)
var(x)
For this particular seed value we get a mean of 2.4955 and variance of 3.733353.
λ | X gamma(5 + xi ,2 + n)
5 + xi
2+n
pm <- rep(0,10000)
set.seed(24)
for (i in 1:10000)
{lambda <- rgamma(1,5,2)
x <- rpois(10,lambda)
pm[i] <- (5+sum(x))/(2+10)}
mean(pm)
λ | X gamma(5 + xi ,2 + n)
5 + xi
2+n
5 + xi n xi + 2 × 5
= ×
2+n 2+n n 2+n 2
where:
n
Z=
2+n
cp <- rep(0,10000)
set.seed(24)
for (i in 1:10000)
{lambda <- rgamma(1,5,2)
x <- rpois(10,lambda)
Z <- 10/(2+10)
cp[i] <- Z*mean(x)+(1-Z)*(5/2)}
mean(cp)
Note: This will be the same as the posterior mean as they are algebraically equivalent.
Hence, for this particular seed value we also get a mean of 2.508583.
Number of years
n<-ncol(data)
Estimates E[m(θ)],E[s2(θ)],var[m(θ)]
m <-mean(rowMeans(data))
s <-mean(apply(data,1,var))
v<-var(rowMeans(data))-s/n
Credibility factor
Z<-n/(n+s/v)
Credibility Premiums
Z* rowMeans(data)+(1-Z)*m
Number of years/risks
n<-ncol(data)
N<-nrow(data)
X <- data/volume
Xibar<-rowSums(data)/rowSums(volume)
Pi <-rowSums(volume)
P <-sum(Pi)
Pstar <-sum(Pi*(1-Pi/P))/(N*n-1)
Estimates E[m(θ)],E[s2(θ)],var[m(θ)]
m<-sum(data)/P
s <-mean(rowSums(volume*(X-Xibar)^2)/(n-1))
v<-(sum(rowSums(volume*(X-m)^2))/(n*N-1)-s)/Pstar
Credibility factor
Zi<-Pi/(Pi+s/v)
Zi*Xibar+(1-Zi)*m
Credibility Premiums
Overview
All confidence interval/test functions boil down to the following:
<data> can be a vector or data frame, for 2 samples you’ll need 2 vectors
alternative is alternative hypothesis, choose from default "two.sided", or "less", "greater"
conf.level is size of the confidence interval, default = 95%
data is an additional option to specify the data frame where variable names in the formula come from (if
it’s not attached), eg t.test(weight,data=chickwts)
<null param> is mu for mean, ratio for 2 variances, p for binomial and r for Poisson.
Binomial
CI poisson.test(x, T=1,alternative="two.sided",conf.level=0.95)
Test poisson.test(x,T=1,r=1, alternative="two.sided",conf.level=0.95)
x is number of events
T is time base for events that occurred, default =1
r is the value of the rate (lambda) in the null hypothesis, default = 1
t.test(<after>,<before>,mu=0,alt="two.sided",conf=0.95,paired=TRUE)
Binomial
CI prop.test(<successes>,<trials>, conf.level=0.95,correct=TRUE)
x is number of successes (or a vector of successes and trials)
n is number of trials (not needed if included in x)
or matrix of successes and failures (not trials)
CI poisson.test(x,T=1, conf.level=0.95)
x is vector of events
T is vector of time base for events that occurred, default =1
If exptd freq < 5 then sim p-value rather than use chi-square approx
chisq.test(f,p=exptd,simulate.p.value=TRUE)
Contingency Table
fisher.test(obs)
For a sample x of size n from a normal distribution, the bootstrap empirical distribution of X is bm:
bm <- rep(0,1000)
set.seed(17)
for(i in 1:1000)
{bm[i]<-mean(rnorm(n,mean=mean(x),sd=sd(x)))}
Or using replicate:
set.seed(17)
bm <- replicate(1000,mean(rnorm(n,mean=mean(x),sd=sd(x))))
95% CI quantile(bm,c(0.025,0.975))
Test – either use equivalence with CIs, compare with critical values or calculate p-value using length.
For a sample x of size n from an unknown distribution, the bootstrap empirical distribution of X is bm:
bm <- rep(0,1000)
set.seed(17)
for(i in 1:1000)
{bm[i]<-mean(sample(x,replace=TRUE))}
Or using replicate:
set.seed(17)
bm <- replicate(1000,mean(sample(x,replace=TRUE)))
n1 <- length(x1)
p<-combn(index,n1)
n<-ncol(p)
dif<-rep(0,n)
for (i in 1:n)
{dif[i]<-mean(results[p[,i]])-mean(results[-p[,i]])}
Find p-value:
length(dif[dif>=ObsT])/length(dif)
n1 <- length(x1)
dif<-rep(0,10000)
set.seed(123)
for (i in 1:10000)
dif[i]<-mean(results[p])-mean(results[-p])}
Find p-value:
length(dif[dif>=ObsT])/length(dif)
nD <- length(D)
library(gtools)
p<-permutations(2,nD,sign,repeats.allowed=TRUE)
n<-nrow(p)
dif<-rep(0,n)
for (i in 1:n)
{dif[i]<-mean(D*p[i,])}
Find p-value:
length(dif[dif>=ObsD])/length(dif)
nD <- length(D)
dif<-rep(0,10000)
set.seed(123)
for (i in 1:10000)
dif[i]<-mean(D*p)}
Find p-value:
length(dif[dif>=ObsD])/length(dif)
Prepare data:
New principal components (efficient orthogonal co-ord syst) based on old components (rotate axes):
P %*% t(W)
plot(<x>,<y>) or plot(<dataframe>)
plot(<dataframe>) or pairs(<dataframe>)
Correlation
cor(<x>,<y>,method="pearson") or cor(<dataframe>,method="pearson")
Correlation Test
( )
For spearman if <50 pairs uses exact, otherwise uses N 0, n1−1 with continuity correction
(
For kendall if <50 pairs uses exact, otherwise uses N 0, 2(2 )
n+ 5)
9 n(n−1)
with continuity correction
Can specify always exact using option exact=TRUE
plot(<x>,<y>) or plot(<dataframe>)
lm(<formula>)
All the functions below assume the linear regression model is stored in <model>.
abline(<model>)
Results can be extracted using: $resid , $coef, $df, $sigma, $r.squared, $fstatistic
However coef and df extract more from summary(model) than they do from model.
Use parm to specify which parameters (default = all) to create a confidence interval for.
Specify by number (1=alpha, 2=beta) or by name (eg “claim”) or skip argument and use indexing.
Fitted values
fitted(<model>) or <model>$fitted
points(<x>,fitted(<model>)
Use newdata to give a data frame of explanatory variables to be used to predict response variables.
eg newdata <- data.frame(<x>=30)
Use interval set to “confidence” for mean response and “predict” for individual response.
Predict without newdata specified gives the fitted values (ie it uses existing explanatory variables).
resid(<model>) or <model>$resid
plot(<model>,1) plot(<model>,2)
residuals vs fitted values Q-Q plot of residuals
plot(<dataframe>) or pairs(<dataframe>)
lm(<formula>)
All the functions below assume the linear regression model is stored in <model>.
Results can be extracted using: $resid , $coef, $df, $sigma, $adj.r.sq, $fstatistic
However coef and df extract more from summary(model) than they do from model.
Use parm to specify which parameters (default = all) to create a confidence interval for.
Specify by number (1=alpha, 2=beta) or by name (eg “claim”) or skip argument and use indexing.
Fitted values
fitted(<model>) or <model>$fitted
Use newdata to give a data frame of explanatory variables to be used to predict response variables.
Use interval set to “confidence” for mean response and “predict” for individual response.
Predict without newdata specified gives the fitted values (ie it uses existing explanatory variables).
resid(<model>) or <model>$resid
Updating models
update(<model>,.~.+<new variable>)
Comparing models
Compare adjusted R2
Check all parameters significant
anova(<model1>,<model2>,test=”F”)
Family and (default) canonical link function and other available link functions:
GLM analysis
Results can be extracted using: $coef, $fitted, $resid, $df.resid, $deviance, $aic
Or by using the functions: coef(<model>), fitted(<model>), resid(<model>)
Results can be extracted using: $resid , $coef, $df, $sigma, $r.squared, $fstatistic
However coef and df extract more from summary(model) than they do from model.
Use parm to specify which parameters (default = all) to create a confidence interval for.
Specify by number (1=alpha, 2=beta) or by name (eg “claim”).
Fitted values
fitted(<model>) or <model>$fitted
Predicted values
resid(<model>,type=”deviance”)
Use type set to “response” for raw residuals, “deviance” (default) or “pearson”
Updating models
update(<model>,.~.+<new variable>)
Comparing models
Compare AIC
Check all parameters significant
anova(<model1>,<model2>,test=”F”)
or anova(<model1>,<model2>,test=”Chisq”)