STTN 225 R Summary

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

STTN 225

R-Notes
Getting help in R:
"?" and the known function
Or just use the built in
> ?var
help function.
"??" and the idea you are looking for
> ??regresion

Assigning variables: Remember R is case


> X = 768 sensitive!!

Data types in R:
Scalars:
Just a 1 x 1 vector
> B = 192.90

Vectors:
Create these with the “c()” function: c for Combine
> D = c(12,34,45)

Matrices:
Create these with the “matrix()” function:
> G = matrix(c(1,2,3,4,5,6), ncol=3, byrow=TRUE)
ncol : number of columns
byrow=TRUE : Filling in by row.
byrow=FALSE : Filling in by column.
List:
Basically a vector that can have anything as its elements.
Create these with the “list()” function:
> I = list(X,D,G)

Data Frame:
A data frame is used for storing data tables.
It is a list of vectors of equal length.
> L = data.frame(Xdata=c(1,2,3), Ydata=c(22,33,44))

Indexing in R:
Vector indexing:
One index only: [...]
> D[2] The 2nd element
> D[1:2] The 1st until 2nd element

Matrix indexing:
Two indexes: [..., ...]
> G[2,3] Element in the 2nd row and the 3rd column
> G[2,] All elements in the 2nd row
> G[,3] All elements in the 3rd column

List:
$ indexing and [[...]] indexing:
> I$X
> I[[2]]

Data frames:
All of the above types of indexing:
> L[2]
> L[1,2]
> L$Ydata
> L[[2]]

Logical indexing:
Indicates TRUE’s and FALSE’s:
> D[D>20]
Reading Data into R:
Use “read.table” to get a text file into R.
> data1 = read.table(“Rdat1.txt”, header=TRUE, skip=0)
“Rdat1.txt”: file name
Header=TRUE: If the text file contains a header
skip=0: The amount of lines skipped before reading the data

Use “file.choose()” to open a file picker in R so you can pick a file


> data2 = read.table(file.choose())

Basic operators in R:
+ Addition
- Subtraction
* Multiplication
/ Division
^ or ** Exponent
%*% Matrix multiplication
%/% Integer division
%% Modulus (Remainder from division)
t(⋅) Transpose of a matrix or vector
solve(⋅) Inverse of a square matrix
%in% Determines if one thing is an element of another
: Sequence

Basic mathematical functions in R:


abs(arg) Calculates the absolute value of arg
exp(arg) Calculates 𝑒 to the power of arg
gamma(arg) Evaluates the gamma function in arg
log(arg) Determines the natural log of arg
log10(arg) Determine the log to base 10 of arg
sign(arg) Determines the sign arg
sqrt(arg) Determines the square root of arg
cos arg) Determine the cosine of arg
sin(arg) Determine the sine arg
tan(arg) Determine the tangent of arg
Basic statistical functions in R:
sum(arg) Calculates the sum of the elements of arg
prod(arg) Calculates the product of the elements of arg
mean(arg) Calculates the mean of arg
median(arg) Calculates the median of arg
Calculates the variance of arg
var(arg)
(or covariance if more than one argument)
sd(arg) Calculates the standard deviation of arg
cor(arg1,arg2) Calculates the correlation between arg1 and arg2
min(arg) Calculates the min of arg
max(arg) Calculates the max of arg
quantile(arg,p) Calculates the pth percentile of arg
sample(arg) Draws a random sample from arg

Basic matrix functions in R:


rbind(arg1,arg2) Appends the row vector arg2 to the matrix arg1
Appends the column vector arg2 to the matrix
cbind(arg1,arg2)
arg1
t(arg) Transposes the matrix arg
Calculates the Eigen-values and Eigen-vectors
eigen(arg)
of arg
Creates a matrix with the vector arg as the
diag(arg)
diagonal elements
diag(arg) Extracts the diagonal components of matrix arg
Finds the inverse of the square non-singular
solve(arg)
matrix arg
Determines the singular value decomposition of
svd(arg)
a matrix arg
Determines the Cholesky decomposition of a
chol(arg)
square, symmetric, positive definite matrix arg
Basic data functions in R:
Returns the length of the vector arg
length(arg)
(Number if elements in the vector)
dim(arg) Returns the dimensions of arg
ncol(arg) Returns the number of columns of arg
nrow(arg) Returns the number of rows of arg
numeric(arg) Creates a vector with arg zeros
Returns the names of arg
names(arg)
(for lists and data frames)
Returns the names of arg
dimnames(arg)
(for matrices and data frames)
Creates a matrix from vector arg1 and
matrix(arg1,ncol=arg2)
number of columns arg2
Creates a data frame from the matrix
data.frame(arg)
arg
Creates a multidimensional array vector
array(arg1,arg2)
arg1 and dimensions arg2
Creates a complex number with real
complex(real,im)
and imaginary part
Basic Statistical Analysis Functions

Sequence Functions

Order Functions
Query Functions

Other Basic Functions


Probability functions in R:

R’s code for the


Distribution Parameters
distribution

Binomial distribution binom size, prob

Geometric distribution geom prob

Poisson distribution pois lambda

Exponential distribution exp lambda

Gamma distribution gamma shape, scale

Beta distribution beta Shape1, shape2, ncp

Normal distribution norm mu, sigma

Uniform distribution unif a, b

Chi-squared distribution chisq df

t distribution t df

F distribution f df1, df2

Weibull distribution weibull shape, scale


Probability functions in R:
All probability distribution functions start with "p".
All density or mass functions start with "d".
All quantile functions start with "q".
All random number generators functions start with "r".

For example:
Suppose that 𝑋∼𝑁(0,92), with distribution function 𝐹(𝑥) = 𝑃(𝑋 < 𝑥) and
density function 𝑓(𝑥).
To determine 𝐹(2.9) = 𝑃(𝑋 < 2.9) we type:
> pnorm(2.9,0,9)
To determine 𝑓(2.9) we type:
> dnorm(2.9,0,9)
To determine 𝐹−1(0.95) we type:
> qnorm(0.95,0,9)
To generate 1000 observations from 𝑋, type:
> rnorm(1000,0,9)

Suppose that W ~ Poisson(lambda=1.5) then:


𝑃(𝑊 <= 4) is: 𝑃(𝑊 < 4) is:
> ppois(4,1.5) > ppois(3,1.5)
𝑃(𝑊 >= 4) is: 𝑃(𝑊 > 4) is:
> ppois(3,1.5, > ppois(4,1.5,
lower.tail=FALSE) lower.tail=FALSE)
or or
> 1 - ppois(3,1.5) > 1 – ppois(4,1.5)
𝑃(𝑊=4) is:
> dpois(4,1.5)
F-1 (0.5) is:
> qpois(0.5,1.5)
Randomly draw 5 values from W
> rpois(5,1.5)
Basic Function writing:
Writing a function:
To write a function simply use the "function()" function.
Assign the function to a variable.
Use "{ }" brackets for grouping commands
Write your code between the "{ }" brackets
Return an answer using either
the " return () " function or
by simply typing the output variable as last line of the function

> FunctionName = function(input1, input2)


> {
> …code… (using input1 and input2)
> outputVariable = list(ans1, ans2)
> return(outputVariable)
> }

Calling a function: (input1 will then be 2, and


> FunctionName(2, 7)
input2 will then be 7)

Iteration and “for” loops:


When writing a "for" loop you need to specify:
the index for counting and
a vector that contains the potential values in the loop:

> for(index in vector)


> {
> …code… (Going to be performed for every index in vector)
> }
If statements:
< Less than
> Greater than
<= Less than or equal to
>= Greater than or equal to
!= Not equal to
== Is equal to
& Elementwise AND
| Elementwise OR

Example:
> if(X > 8)
> {
> print("X larger than 8")
> }
> else if (X <= 8)
> {
> print("X smaller than 8")
> }
> else
> {
> print("otherwise")
> }

Plots in R:
The basic histogram in R is
> hist(X)

The kernel density estimates


> plot(density(X))

A histogram with a Kernel density estimates


> hist(X,freq=FALSE)
> lines(density(X))
Monte Carlo Bootstrap
Population is completely known: Population is unknown. In this situation we have
only one sample.
In this situation we are able to draw multiple
samples from the population, We pretend the sample is a "population" and
draw multiple samples from this "population"
We then calculate the statistic each time, and
with replacement.
see the (sampling) distribution of the statistic
(i.e., how it changes from one sample to the As before, we then calculate the statistic for
next). each "resample", and estimate the (sampling)
distribution of the statistic.

Monte Carlo Bootstrap


MonteCarlo = function(MC,n) Bootstrap = function(X,B,n)
{ {
MCmedian = numeric(MC) thstar = numeric(B)
for(mc in 1:MC) for(b in 1:B)
{ {
x = rnorm(n,10,3) xstar = sample(X,n,replace=TRUE)
MCmedian[mc] = median(x) #Or mean thstar[b] = median(xstar) #Or mean
} }
hist(MCmedian,freq=FALSE) hist(thstar,freq=FALSE)
lines(density(MCmedian)) lines(density(thstar))
ans = sd(MCmedian) ans = sd(thstar)
return(ans) return(ans)
} }

MonteCarlo(10000,10) Bootstrap(X,10000,10)
Likelihood

The functions we want to maximize are likelihood functions:

𝑙𝑖𝑘(𝜃|𝑋) = 𝑓(𝑋1 , … , 𝑋𝑛 |𝜃)

where 𝑓(𝑥1,…,𝑥𝑛) is a joint density function.


For independent observations this function becomes
𝑛

𝑙𝑖𝑘(𝜃|𝑋) = 𝑓(𝑋1 |𝜃) × ⋯ × 𝑓(𝑋𝑛 |𝜃) = ∏ 𝑓(𝑋𝑖 |𝜃)


𝑖=1
where 𝑓(𝑥) is a density function.

The log-likelihood is defined as

ℓ(𝜃|𝑋): = 𝑙𝑜𝑔(𝑙𝑖𝑘(𝜃|𝑋))

Note: these are functions of 𝜃, not 𝑥.

Maximum Likelihood Estimation

For a given set of data 𝑋1 , … , 𝑋𝑛 , we want to find the values of 𝜃 that make
these likelihood function values as big as possible (maximum).
The 𝜃 values that produce these maximum function values are called
Maximum Likelihood Estimators (MLEs).
This is typically "difficult" to do analytically (i.e., with math).
However, can be done numerically quite easily by using a "brute force"
approach.
The “Brute Force” Approach

Suppose we want to find the value of 𝜃 that produces the largest 𝑙𝑖𝑘(𝜃).
Create a "grid" of 𝜃 values and evaluate the function in every single value.
The 𝜃 that produces the maximum 𝑙𝑖𝑘(𝜃) value is your "optimal" 𝜃.
Note: The finer the grid the better the MLE!

MLE Algorithm (using 𝑙𝑖𝑘(𝜃|𝑿))

1) Start by setting up a grid of size G of possible values for the parameter


you want to estimate, i.e., grid = {𝜃1 , 𝜃2 , … , 𝜃𝐺 }
2) Pick a 𝜃 value in this grid, say 𝜃𝑗 .
3) Using the given sample data and the 𝜃𝑗 value chosen in (2), calculate
the density function value in each of the observed data points,
𝑓(𝑋1 |𝜃𝑗 ) , … , 𝑓(𝑋𝑛 |𝜃𝑗 ).
4) Multiply all these function values together and call the result 𝑙𝑖𝑘𝑗 .
5) Repeat steps (2) to (4) for each 𝜃𝑗 in the grid (i.e., G times).
6) We now have G likelihood values, 𝑙𝑖𝑘1 , 𝑙𝑖𝑘2 , … , 𝑙𝑖𝑘𝐺 .
7) Now determine which of these 𝑙𝑖𝑘𝑗 values is the largest. The 𝜃𝑗 value
that corresponds with the largest one is the approximate maximum
likelihood estimator, i.e., 𝜃̂ = 𝜃𝑗 ∗ ,where 𝑗 ∗ = arg(max(𝑙𝑖𝑘𝑗 ))
Sample code using 𝑙𝑖𝑘(𝜃|𝑿)

x = Data
G = 100000
grid = seq(0.001,10,length = G)
lik = numeric(G) #lik = likelihood
i = 1

for(thetaj in grid)
{
#likj = product(f(Xi|Oj))
lik[i] = prod(dchisq(x,thetaj)) ##Note the change
i = i + 1
}

indx = which(lik == max(lik))


theta_hat = grid[indx]

plot(grid,lik,type = "l")
abline(v = grid[indx], lty = 2)

MLE Algorithm (using 𝓵(𝜃|𝑿))

1) Start by setting up a grid of size G of possible values for the parameter


you want to estimate, i.e., grid = {𝜃1 , 𝜃2 , … , 𝜃𝐺 }
2) Pick a 𝜃 value in this grid, say 𝜃𝑗 .
3) Using the given sample data and the 𝜃𝑗 value chosen in (2), calculate
the density function value in each of the observed data points,
log[𝑓(𝑋1 |𝜃𝑗 )] , … , 𝑙𝑜𝑔[𝑓(𝑋𝑛 |𝜃𝑗 )].
4) Add all these function values together and call the result 𝓵𝑗 .
5) Repeat steps (2) to (4) for each 𝜃𝑗 in the grid (i.e., G times).
6) We now have G log-likelihood values, 𝓵1 , 𝓵2 , … , 𝓵𝐺 .
7) Now determine which of these 𝓵𝑗 values is the largest. The 𝜃𝑗 value that
corresponds with the largest one is the approximate maximum likelihood
estimator, i.e., 𝜃̂ = 𝜃𝑗 ∗ ,where 𝑗 ∗ = arg(max(𝓵𝑗 ))
Sample code using 𝓵(𝜃|𝑿)

x = Data
G = 100000
grid = seq(0.001,10,length = G)
l = numeric(G) #l = log-likelihood
i = 1

for(thetaj in grid)
{
#log-likj = sum(log(f(Xi|Oj)))
l[i] = sum(log(dchisq(x,thetaj))) ##Note the change
i = i + 1
}

indx = which(l == max(l))


theta_hat = grid[indx]

plot(grid,l,type = "l")
abline(v = grid[indx], lty = 2)
Bootstrap Basic Percentile Confidence Interval

Let 𝑋1 , 𝑋2 , … , 𝑋𝑛 denote i.i.d. random variables from an unknown distribution


F and let θ̂ = θ̂(X1 , X2 , … , Xn ) be an estimator for 𝜃.
Sample independently, with replacement, from 𝑋1 , 𝑋2 , … , 𝑋𝑛 , to create 𝐵
“bootstrap” samples, 𝑋1∗ , 𝑋2∗ , … , 𝑋𝑛∗ , each of size 𝑛.
For each of the 𝐵 samples calculate the statistic θ̂∗ = θ̂(X1∗ , X2∗ , … , Xn∗ ):
(1) X1∗ , X2∗ , … , Xn∗ 𝑐𝑎𝑙𝑐𝑢𝑙𝑎𝑡𝑒 θ̂1∗
(2) X1∗ , X2∗ , … , Xn∗ 𝑐𝑎𝑙𝑐𝑢𝑙𝑎𝑡𝑒 θ̂∗2
⋮ ⋮ ⋮
(𝐵) X1∗ , X2∗ , … , Xn∗ 𝑐𝑎𝑙𝑐𝑢𝑙𝑎𝑡𝑒 θ̂∗𝐵
Obtain the order statistics θ̂∗(1) ≤ θ̂∗(2) ≤ ⋯ ≤ θ̂∗(𝐵)

A (1−𝛼)% basic percentile confidence interval for 𝜃 is then given by


𝛼 𝛼
[θ̂∗(𝑟) ; θ̂∗(𝑠) ] Where 𝑟 = ⌊𝐵 ( )⌋ 𝑎𝑛𝑑 𝑠 = ⌊𝐵 (1 − )⌋.
2 2

Bootstrap Hybrid Percentile Confidence Interval

The bootstrap hybrid percentile method is nearly identical to the basic


percentile method, with the exception of the very last step:
Following the same procedure used in the basic percentile CI, obtain the
order statistics θ̂∗(1) ≤ θ̂∗(2) ≤ ⋯ ≤ θ̂∗(𝐵)

A (1−𝛼)% hybrid percentile confidence interval for 𝜃 is then given by


𝛼 𝛼
[2θ̂ − θ̂∗(𝑠) ; 2θ̂ − θ̂∗(𝑟) ] Where 𝑟 = ⌊𝐵 ( )⌋ 𝑎𝑛𝑑 𝑠 = ⌊𝐵 (1 − )⌋.
2 2
Sample code

BootCI = function(x,alpha,B)
{
n = length(x)
mstar = numeric(B)
for(b in 1:B)
{
xstar = sample(x,n,replace=TRUE)
mstar[b] = median(xstar) #or mean
}
mstar = sort(mstar)

r = floor(B*alpha/2)
s = floor(B*(1 - alpha/2))

basicCI = c(mstar[r],mstar[s])
hybridCI = c(2*median(x)-mstar[s],2*median(x)-mstar[r])

ans = list(basicCI=basicCI, hybridCI=hybridCI)


return(ans)
}

X = Data
BootCI(X, 0.05, 1000)

You might also like