R Tutorial Lecture Notes
R Tutorial Lecture Notes
Lecture Notes on
Introductory Statistics using R
Walter Bazán-Palomino
School of Economics and Finance
Universidad del Pacífico (University of the Pacific)
This document is designed to give you an introduction to the use of R and RStudio.
Besides, these notes are built to introduce you to Basic Statistics and Inferential
Statistics using RStudio. The goal of this document is not to show all the features
of RStudio, or to replace a standard textbook in Statistics, but to complement it.
The notes start with a definition of basic commands and plots, following by <lower-
level= and <medium-level= statistics material.
1. R AND RSTUDIO
R is an open-source (GPL) statistical environment modeled after S and S-Plus. The R project
was started by Robert Gentleman and Ross Ihaka of the Statistics Department of the University
of Auckland in 1995.
R is a free computer programming language; a nontrivial feature. It is easy to learn its syntax
with many built-in statistical functions, and R language allow us to create our own functions
(user-written functions). R and RStudio run on UNIX, Windows and Macintosh.
The command language is a programming language so students must learn to appreciate
syntax issues etc.
88R can do anything you can imagine,99 and this is hardly an overstatement. With R you can
write functions, do calculations, apply most available statistical techniques, create simple or
complicated graphs, and even write your own library functions. A large user group supports it.
Many research institutes, companies, and universities have migrated to R.
For the purpose of this course, R has excellent statistical facilities. Nearly everything you may
need in terms of statistics has already been programmed and made available in R (either as
part of the main package or as a user-contributed package).
RStudio is an integrated development environment (IDE) for R. It includes a console, syntax-
highlighting editor that supports direct code execution, as well as tools for plotting, history,
debugging and workspace management.
How to install R and RStudio in Windows
You can download the latest version of R from
https://cran.r-project.org/bin/windows/base/
Once the installation is done, you need to download RStudio. RStudio is an Integrated
Development Environment (IDE) for R which is friendlier than and runs on top of R.
https://www.rstudio.com/products/rstudio/download/
The console is the <command window.= Type 2+2 after the > symbol (which is where the cursor
appears), and click enter. The spacing in your command is not relevant.
> 2+2
[1] 4
The simple example shows how RStudio (R) works; you type something, press enter, and
RStudio will carry out your commands. The trick is to type in sensible things. Mistakes can
easily be made. For example, suppose you want to calculate the logarithm of 2 with base 10.
You may type log(2) and receive 0.6931472 as the answer. This is the natural logarithm, thus
the 0.693 is not the correct answer.
> log(2)
[1] 0.6931472
> log10(2)
[1] 0.30103
All variables created in RStudio are stored in a common workspace. To see which variables are
defined in the workspace, you can use the function ls(list). Remember that you cannot omit the
parentheses in ls().
> ls()
[1] "counts" "h" "heading" "i"
"n"
[6] "opts" "poly.model" "predicted.intervals" "splin
e.model" "x"
[11] "x.eval" "xfit" "y" "yfit"
If you want to delete some of the objects, you must use the command rm(remove), so that
> rm(height, weight) #deletes the variables height and weight
The above output shows that numbers and words can be included in a vector. Of course, the
words must be placed in simple quotes (not smart quotes of MS Word).
If R is expecting continuing command it prompts with "+". It may be an indication that something
is wrong and it may be better to press escape key to get out. It can be because parentheses do
not match or other syntax errors.
Naming an object x
R is an object-oriented language. Almost everything is an object with a name. <x =" or <x <-< are
both assignment operations to create an R object named x.
> x <- 4
> x
[1] 4
The R object names are almost arbitrary, except that they cannot start with numbers or contain
symbols. It is not advisable to use common commands as names of R objects (e.g. sum, mean,
sd, c, sin, cos, pi, exp, etc described later). Everything in R including object names is case-
sensitive. Note that 3x is not a valid name of an R object.
> 3x=1:4
Error: unexpected symbol in "3x"
For example, x=5 means 5 stored under a name x. Also x <- c(1,2,3,4) defines variable x as =
(1,2,3,4). Alternatively use x=1:4.
> x=1:4
> x
[1] 1 2 3 4
> y=c(1:3,NA,4);y
[1] 1 2 3 NA 4
> sum(y)
[1] NA
> sum(y,na.rm=TRUE)
[1] 10
The above output shows that the sum(x) is NA if we do not recognize the presence of NA and
explicitly ask R to remove it (na.rm means remove NAs) before computing the sum. The option
`na.rm=TRUE' is available for computation of mean, median, standard deviation, variance, etc.
Less sophisticated software gives incorrect number of observations and wrong answers in the
presence of NA's.
R allows us to manipulate objects as a calculator. Note that exp() is a function in R and exp(1)
means e raised to power 1. Note also that the 8c9 function of R defines a catalog or list of two or
more values. R does not understand a mere list of things without the 8c9 command. Print
command of R needs the 8c' above, because we want to print more than one thing from a list.
#First Example
> pi
[1] 3.141593
> exp(1)
[1] 2.718282
> print(c(pi,exp(1))) #prints to screen values of the pi symbol and e symbol
[1] 3.141593 2.718282
#Second Example
> x <- 4; y<- 9
> x+y
[1] 13
#Third Example
> x <- c(1,2,3,4); y<- c(5,6,7,8)
> x+y
[1] 6 8 10 12
> x <- 1:4; y<- 5:8
> x+y
[1] 6 8 10 12
> x*y
[1] 5 12 21 32
> sum(x*y)
[1] 70
> x%*%y
[,1]
[1,] 70
1.2. Scripts
Perhaps, you do not want to work with R on a line-by-line basis. For instance, if you have
entered 8 commands over eight lines and realize that you made a mistake, it is better to work
with R scripts instead of reentering all the commands again. The collection of these lines of
commands can be stored in a file called script. Such a file is usually saved with a .R extension.
In RStudio, you can run a line of code of a R script file by placing a cursor anywhere on that line
(while being careful not to highlight any subset of that line) and pressing the shortcut keys
Ctrl+Enter on a Windows keyboard or Command+Enter on a Mac.
You can also run an entire block of code by selecting all lines to be run then pressing the
shortcut keys Ctrl+Enter/Command+Enter. Or, you can run the entire R script by pressing
Ctrl+Alt+R in Windows or Command+Option+R on a Mac.
1.3. Packages
An R installation contains one or more libraries of packages. Some of these packages are part
of the basic installation. Installing a package simply means downloading the package code onto
your personal computer. There are two main ways to install new packages. The first, and most
common, method is to download them from the Comprehensive R Archive Network (CRAN).
CRAN is the central repository for R packages, which currently hosts over 1000 packages for
various purposes. You can even create your own packages. To install a new R package from
CRAN, you can simply run the code install.packages("name"), where <name= is the name of the
package.
Once you have installed a package, it is on your computer. However, just because it is on your
computer does not mean R is ready to use it. If you want to use something, like a function or
dataset, from a package you always need to load the package in your R session first. To load a
package, you use the library(<name=) function.
R Relational Operators
Operator Description
< Less than
> Greater than
<= Less than or equal to
>= Greater than or equal to
== Equal to
!= Not equal to
R Logical Operators
Operator Description
! Logical NOT
& Element-wise logical AND
&& Logical AND
| Element-wise logical OR
|| Logical OR
2. DATA TYPE
There is a wide variety of data types, including vectors (numerical, character, and logical),
matrices, strings, factors, arrays, lists, and data frames.
2.1. Vectors
> a <- c(1,2,5.3,6,-2,4) # numeric vector
> a
[1] 1.0 2.0 5.3 6.0 -2.0 4.0
> b <- c("one","two","three") # character vector
> b
[1] "one" "two" "three"
> c <- c(TRUE,TRUE,TRUE,FALSE,TRUE,FALSE) #logical vector
> c
[1] TRUE TRUE TRUE FALSE TRUE FALSE
> a[c(2,4)] # 2nd and 4th elements of vector
[1] 2 6
2.2. Matrices
All columns in a matrix must have the same mode (numeric, character, etc.) and the same
length. The general format is
mymatrix <- matrix(vector, nrow=r, ncol=c, byrow=FALSE,
dimnames=list(char_vector_rownames, char_vector_colnames))
byrow=TRUE indicates that the matrix should be filled by rows. byrow=FALSE indicates that the
matrix should be filled by columns (the default). dimnames provides optional labels for the
columns and rows.
#First Example
#generates 5 x 4 numeric matrix
y<-matrix(1:20, nrow=5,ncol=4)
> y
[,1] [,2] [,3] [,4]
[1,] 1 6 11 16
[2,] 2 7 12 17
[3,] 3 8 13 18
[4,] 4 9 14 19
[5,] 5 10 15 20
#Second Example
> vector4matrix <- c(47,6,213,9)
> vector4matrix
[1] 47 6 213 9
> rownames <- c("R1", "R2")
> colnames <- c("C1", "C2")
> mymatrix <- matrix(vector4matrix, nrow=2, ncol=2, byrow=TRUE,
+ dimnames=list(rownames, colnames))
> mymatrix
C1 C2
R1 47 6
R2 213 9
2.3. Strings
A string is specified by using quotes and you can store it like a <vector=.
> a <- "Good"
> a
[1] "Good"
> b <- c("Good","Bye")
> b
[1] "Good" "Bye"
2.4. Factors
Conceptually, factors are variables in R which take on a limited number of different values; such
variables are often referred to as categorical variables. One of the most important uses of
factors is in statistical modeling; since categorical variables enter into statistical models
differently than continuous variables, storing data as factors insures that the modeling functions
will treat such data correctly.
Factors in R are stored as a vector of integer values with a corresponding set of character
values to use when the factor is displayed. The factor function is used to create a factor.
> data = c(1,2,2,3,1,2,3,3,1,2,3,3,1)
> fdata = factor(data)
> fdata
[1] 1 2 2 3 1 2 3 3 1 2 3 3 1
Levels: 1 2 3
> rdata = factor(data,labels=c("I","II","III"))
> rdata
[1] I II II III I II III III I II III III I
Levels: I II III
Factors represent a very efficient way to store character values, because each unique character
value is stored only once, and the data itself is stored as a vector of integers. Because of this,
read.table() will automatically convert character variables to factors unless the as.is= argument
is specified. See Section Data Management for details.
> mons = c("March","April","January","November","January", "September","Octob
er","September","November","August",
+ "January","November","November","February","May","August",
+ "July","December","August","August","September","November",
+ "February","April")
> mons = factor(mons)
> table(mons)
mons
April August December February January July March M
ay November October September
2 4 1 2 3 1 1
1 5 1 3
2.5. Arrays
Arrays are the R data objects which can store data in more than two dimensions. Arrays are
similar to matrices but can have more than two dimensions. See help(array) for details.
> # Create two vectors of different lengths.
> vector1 <- c(1,1,0)
> vector2 <- c(47,48,45,50,51,49)
> # Take these vectors as input to the array.
> array.1 <- array(c(vector1,vector2),dim = c(3,3,2))
> print(array.1)
, , 1
, , 2
Let9s name the rows, columns and each layer of the array. I am going to call each layer as
<Matrix.=
> col.names <- c("COLUMN1","COLUMN2","COLUMN3")
> row.names <- c("ROW1","ROW2","ROW3")
> matrix.names <- c("Matrix1","Matrix2")
> array.2 <- array(c(vector1,vector2),dim = c(3,3,2),dimnames = list(row.name
s, col.names, matrix.names))
> print(array.2)
, , Matrix1
, , Matrix2
2.6. Lists
Lists provide a way to store a variety of objects of possibly varying modes in a single R object
> mylist = list(c(1,2,3),"GDP",TRUE,"Inflation",40,c(9,10,11))
> mylist
[[1]]
[1] 1 2 3
[[2]]
[1] "GDP"
[[3]]
[1] TRUE
[[4]]
[1] "Inflation"
[[5]]
[1] 40
[[6]]
[1] 9 10 11
[[2]]
[1] 30
[[3]]
[1] 30
[[4]]
[1] 31
[[5]]
[1] "March"
[[6]]
[1] "June"
[[7]]
[1] "September"
[[8]]
[1] "December"
The structure of the data frame in R can be seen by using str() function
> str(students_df)
'data.frame': 4 obs. of 2 variables:
$ subject : Factor w/ 4 levels "Chemistry","English",..: 2 3 1 4
$ percentage: num 80 100 85 95
The statistical summary and nature of the data can be obtained by applying summary() function.
> summary(students_df)
subject percentage
Chemistry:1 Min. : 80.00
Recall
length(object) # number of elements or components
str(object) # structure of an object
class(object) # class or type of an object
names(object) # names
c(object,object,...) # combine objects into a vector
cbind(object, object, ...) # combine objects as columns
rbind(object, object, ...) # combine objects as rows
object # prints the object
ls() # list current objects
rm(object) # delete an object
sum(object) # returns the sum of all the values present in its arguments
seq(object) # generate regular sequence
rep(object) # replicates the values in its arguments
3. FUNCTIONS
In this section, we discuss applying some simple functions to the data, such as the mean or the
mean of a single data subset. Basic functions are available by default. Other functions are
contained in packages that can be attached to a current session as needed, and you can create
your own functions.
R has data structures or objects (vectors, matrices, arrays, and data frames) that you can
operate on through functions that perform statistical analyses and create graphs.
In the following example, I will show you how to use four of the most important R functions for
descriptive statistics: colSums, rowSums, colMeans, and rowMeans.
> set.seed(1234) # Set seed
> data <- data.frame(matrix(round(runif(20, 4, 34)), # Create example data
+ nrow = 4, ncol = 5))
> data
X1 X2 X3 X4 X5
1 7 30 24 12 13
2 23 23 19 32 12
3 22 4 25 13 10
4 23 11 20 29 11
$X4
[1] 80.09211
$X5
[1] 84.87105
$cat
[1] 0.6605263
It is important to realize that tapply calculates the mean (or any other function) for subsets of
observations of a variable, whereas lapply and sapply calculate the mean (or any other function)
of one or more variables, using all observations.
function.name: is the function9s name. This can be any valid variable name, but you should
avoid using names that are used elsewhere in R, such as dir, function, plot, etc.
arg1, arg2, arg3: these are the arguments of the function, also called formals. You can write a
function with any number of arguments. These can be any R object: numbers, strings, arrays,
data frames, of even pointers to other functions; anything that is needed for the function.name
function to run.
Some arguments have default values specified, such as arg3 in our example. Arguments
without a default must have a value supplied for the function to run. You do not need to provide
a value for those arguments with a default, as the function will use the default value.
The 8…9 argument: The ..., or ellipsis, element in the function definition allows for other
arguments to be passed into the function, and passed onto to another function. This technique
is often in plotting, but has uses in many other places.
Function body: The function code between the within the {} brackets is run every time the
function is called. This code might be very long or very short. Ideally functions are short and do
just one thing – problems are rarely too small to benefit from some abstraction. Sometimes a
large function is unavoidable, but usually these can be in turn constructed from a bunch of small
functions. More on that below.
Return value: The last line of the code is the value that will be returned by the function. It is not
necessary that a function return anything, for example a function that makes a plot might not
return anything, whereas a function that does a mathematical operation might return a number,
or a list.
Load the function into the R session
For R to be able to execute your function, it needs first to be read into memory. This is just like
loading a library, until you do it the functions contained within it cannot be called.
There are two methods for loading functions into the memory:
1. Copy the function text and paste it into the console
2. Use the source() function to load your functions from file.
Our recommendation for writing nice R code is that in most cases, you should use the second of
these options. Put your functions into a file with an intuitive name, like plotting-fun.R and save
this file within the R folder in your project.
Examples
Let9s start by defining a function fahrenheit_to_celsius that converts temperatures from
Fahrenheit to Celsius:
> fahrenheit_to_celsius <- function(temp_F) {
+ temp_C <- (temp_F - 32) * 5 / 9
+ return(temp_C)
+ }
> fahrenheit_to_celsius(52)
[1] 11.11111
Here is function to compute the permutations of n things taken k at a time. The formula is nPk=
n! / (n - k)!. Let the function name be permute. Note that the word `function' must be present and
that it represents an algorithm needing some inputs (arguments) and it returns some output
denoted as `out'. A function can be a very long command extending over many lines. Any long
command in R can be entered on several lines simply by using curly braces. Thus curly braces
have a special meaning in R. The arguments of functions are always in simple parentheses ().
> permute=function(n,k){
+ out=factorial(n)/factorial(n-k)
+ return(out)
+ }
> permute(4,2)
[1] 12
When you write your own function, it is important to check it against a known answer given
above for N=10, n=4, k=3.
4. DATA MANAGEMENT
Now that we have learned how matrix algebra, statistics functions and other basic functions, we
are ready to deal with and to analyze different datasets.
From Excel to R
library("readxl")
read_excel("Path where your Excel file is stored\\File Name.xlsx",sheet = "Your sheet name")
From Stata to R
library(foreign)
mydata <- read.dta("c:/mydata.dta")
From R to Excel
write.csv(Your Data, file="Path where you'd like to export the Data\\File Name.csv", row.names
= FALSE)
library(xlsx)
write.xlsx(mydata, "c:/mydata.xlsx")
From R to Stata
library(foreign)
write.dta(mydata, "c:/mydata.dta")
2 10 3 10 7 3 3 NA 4 1 1
3 2 8 10 7 5 NA 7 NA 5 6
4 2 NA 5 10 9 1 1 10 10 10
5 4 6 6 7 9 7 8 10 6 9
6 3 2 8 4 3 7 NA 8 3 5
7 4 6 10 7 1 1 9 4 6 9
8 3 4 4 NA 9 2 9 9 8 10
9 4 3 1 7 2 3 4 5 8 1
10 8 7 6 5 1 7 1 2 7 5
> d[is.na(d)] = 0
> d
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 10 6 6 2 9 5 0 6 0 6
2 10 3 10 7 3 3 0 4 1 1
3 2 8 10 7 5 0 7 0 5 6
4 2 0 5 10 9 1 1 10 10 10
5 4 6 6 7 9 7 8 10 6 9
6 3 2 8 4 3 7 0 8 3 5
7 4 6 10 7 1 1 9 4 6 9
8 3 4 4 0 9 2 9 9 8 10
9 4 3 1 7 2 3 4 5 8 1
10 8 7 6 5 1 7 1 2 7 5
[1] 6
> d[complete.cases(d),] # Remove all of the rows with "NA"
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
5 4 6 6 7 9 7 8 10 6 9
7 4 6 10 7 1 1 9 4 6 9
9 4 3 1 7 2 3 4 5 8 1
10 8 7 6 5 1 7 1 2 7 5
Note: in the section <Loops=, we are going to further learn how to clean data
By default, the plot function uses open circles (open dots) as plotting characters, but characters
can be selected from about 20 additional symbols. The plotting character is specified with the
pch option in the plot function.
Histograms
> plot(wt, mpg)
> abline(lm(mpg~wt))
> title("Regression of MPG on Weight")
> hist(mtcars$mpg)
> hist(mtcars$mpg, breaks=12, col="red")
> x <- mtcars$mpg
> h<-hist(x, breaks=10, col="red", xlab="Miles Per Gallon", main="Histogram w
ith Normal Curve")
> xfit<-seq(min(x),max(x),length=40)
> yfit<-dnorm(xfit,mean=mean(x),sd=sd(x))
> yfit <- yfit*diff(h$mids[1:2])*length(x)
> lines(xfit, yfit, col="blue", lwd=2)
Dot Chart
> dotchart(mtcars$mpg,labels=row.names(mtcars),cex=.7,
+ main="Gas Milage for Car Models",
+ xlab="Miles Per Gallon")
Bar Plot
> # Simple Bar Plot
> counts <- table(mtcars$gear)
> barplot(counts, main="Car Distribution",
+ xlab="Number of Gears")
> # Simple Horizontal Bar Plot with Added Labels
> counts <- table(mtcars$gear)
> barplot(counts, main="Car Distribution", horiz=TRUE,
+ names.arg=c("3 Gears", "4 Gears", "5 Gears"))
> # Stacked Bar Plot with Colors and Legend
> counts <- table(mtcars$vs, mtcars$gear)
> barplot(counts, main="Car Distribution by Gears and VS",
+ xlab="Number of Gears", col=c("darkblue","red"),
+ legend = rownames(counts))
> # Grouped Bar Plot
option description
Alpha transparency for overlapping elements expressed as a fraction
alpha between 0 (complete transparency) and 1 (complete opacity)
Associates the levels of variable with symbol color, shape, or size. For line
color, plots, color associates levels of a variable with line color. For density and box
shape, plots, fill associates fill colors with a variable. Legends are drawn
size, fill automatically.
data Specifies a data frame
Creates a trellis graph by specifying conditioning variables. Its value is
expressed as rowvar ~ colvar. To create trellis graphs based on a single
facets conditioning variable, use rowvar~. or .~colvar)
Specifies the geometric objects that define the graph type. The geom option
is expressed as a character vector with one or more entries. geom values
include "point", "smooth", "boxplot", "line", "histogram", "density", "bar",
geom and "jitter".
main,
sub Character vectors specifying the title and subtitle
If geom="smooth", a loess fit line and confidence limits are added by
default. When the number of observations is greater than 1,000, a more
efficient smoothing algorithm is employed. Methods include "lm" for
regression, "gam" for generalized additive models, and "rlm" for robust
regression. The formula parameter gives the form of the fit.
For example, to add simple linear regression lines, you'd specify
geom="smooth", method="lm", formula=y~x. Changing the formula to
y~poly(x,2) would produce a quadratic fit. Note that the formula uses the
letters x and y, not the names of the variables.
method, For method="gam", be sure to load the mgcv package. For method="rml",
formula load the MASS package.
Specifies the variables placed on the horizontal and vertical axis. For
x, y univariate plots (for example, histograms), omit y
xlab,
ylab Character vectors specifying horizontal and vertical axis labels
Two-element numeric vectors giving the minimum and maximum values for
xlim,ylim the horizontal and vertical axes, respectively
Horsepower
> qplot(wt, mpg, data=mtcars, geom=c("point", "smooth"),
+ method="lm", formula=y~x, color=cyl,
+ main="Regression of MPG on Weight",
+ xlab="Weight", ylab="Miles per Gallon")
Separate Regressions
Boxplots
Customize Graph
Unlike base R graphs, the ggplot2 graphs are not effected by many of the options set in the par(
) function. They can be modified using the theme() function, and by adding graphic parameters
within the qplot() function. For greater control, use ggplot() and other functions provided by the
package. Note that ggplot2 functions can be chained with "+" signs to generate the final plot.
> dat <- data.frame(
+ time = factor(c("Lunch","Dinner"), levels=c("Lunch","Dinner")),
+ total_bill = c(14.89, 17.23)
+ )
> dat
time total_bill
1 Lunch 14.89
2 Dinner 17.23
> ggplot(data=dat, aes(x=time, y=total_bill)) +
+ geom_bar(stat="identity")
> ggplot(data=dat, aes(x=time, y=total_bill, fill=time)) +
+ geom_bar(stat="identity")
> ggplot(data=dat, aes(x=time, y=total_bill, fill=time)) +
+ geom_bar(colour="black", stat="identity")
> ggplot(data=dat, aes(x=time, y=total_bill, fill=time)) +
+ geom_bar(colour="black", stat="identity") +
+ guides(fill=FALSE)
4.4. Tables
Categorical data are usually described in the form of tables. This section outlines how you can
create tables from your data and calculate relative frequencies.
The matrix function needs an argument containing the table values as a single vector and also
the number of rows in the argument nrow. By default, the values are entered columnwise; if
rowwise entry is desired, then you need to specify byrow=T.
> caff.marital <- matrix(c(652,1537,598,242,36,46,38,21,218,327,106,67), nrow
=3,byrow=T)
> caff.marital
[,1] [,2] [,3] [,4]
[1,] 652 1537 598 242
[2,] 36 46 38 21
[3,] 218 327 106 67
You might also give the number of columns instead of rows using ncol. If exactly one of ncol
and nrow is given, R will compute the other one so that it fits the number of values. If both ncol
and nrow are given and it does not fit the number of values, the values will be <recycled=, which
in some (other!) circumstances can be useful. To get readable printouts, you can add row and
column names to the matrices.
> colnames(caff.marital) <- c("0","1-150","151-300",">300")
> rownames(caff.marital) <- c("Married","Prev.married","Single")
> caff.marital
0 1-150 151-300 >300
Married 652 1537 598 242
Prev.married 36 46 38 21
Single 218 327 106 67
There is a "table" class for which special methods exist, and you can convert to that class using
as.table(caff.marital). The table function below returns an object of class "table". For most
elementary purposes, you can use matrices where two-dimensional tables are expected. One
important case where you do need as.table() is when converting a table to a data frame of
counts:
> as.data.frame(as.table(caff.marital))
Var1 Var2 Freq
1 Married 0 652
2 Prev.married 0 36
3 Single 0 218
4 Married 1-150 1537
5 Prev.married 1-150 46
6 Single 1-150 327
7 Married 151-300 598
8 Prev.married 151-300 38
9 Single 151-300 106
10 Married >300 242
11 Prev.married >300 21
12 Single >300 67
Turning now to another example, we are going to create a table from raw data and call it
<smoker.= This data set was created only to be used as an example. This is not as direct a
method as might be desired. Here we create an array of numbers, specify the row and column
names, and then convert it to a table.
> smoke <- matrix(c(51,43,22,92,28,21,68,22,9),ncol=3,byrow=TRUE)
> colnames(smoke) <- c("High","Low","Middle")
> rownames(smoke) <- c("current","former","never")
> smoke <- as.table(smoke)
> smoke
High Low Middle
current 51 43 22
former 92 28 21
never 68 22 9
There are a number of ways to get the marginal distributions using the margin.table() command.
If you just give the command the table it calculates the total number of observations. You can
also calculate the marginal distributions across the rows or columns based on the one optional
argument:
> margin.table(smoke)
[1] 356
> margin.table(smoke,1)
current former never
116 141 99
> margin.table(smoke,2)
High Low Middle
211 93 52
That is a little obtuse, so fortunately, there is a better way to get the proportions using the
prop.table() command. You can specify the proportions with respect to the different marginal
distributions using the optional argument:
> prop.table(smoke)
High Low Middle
current 0.14325843 0.12078652 0.06179775
former 0.25842697 0.07865169 0.05898876
never 0.19101124 0.06179775 0.02528090
> prop.table(smoke,1)
High Low Middle
current 0.43965517 0.37068966 0.18965517
former 0.65248227 0.19858156 0.14893617
never 0.68686869 0.22222222 0.09090909
> prop.table(smoke,2)
High Low Middle
current 0.2417062 0.4623656 0.4230769
former 0.4360190 0.3010753 0.4038462
never 0.3222749 0.2365591 0.1730769
Our first analysis is the linear regression. It uses the lm() command.
> reg1=lm(dist~speed,data=cars)
> summary(reg1)
Call:
lm(formula = dist ~ speed, data = cars)
Residuals:
Min 1Q Median 3Q Max
-29.069 -9.525 -2.272 9.215 43.201
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -17.5791 6.7584 -2.601 0.0123 *
speed 3.9324 0.4155 9.464 1.49e-12 ***
---
Signif. codes: 0 8***9 0.001 8**9 0.01 8*9 0.05 8.9 0.1 8 9 1
We are going to plot the data and add the regression line to the plot
> plot(dist~speed, data=cars)
> # Name the regression object
> reg1=lm(dist~speed,data=cars)
> # Add the regression line to the plot
> abline(reg1)
An lm object is a list of named objects. We can access all of these names via names().
> names(reg1)
There is also a plot method for lm objects that gives the diagnostic information:
> par(mfrow=c(2,2))
> # Plot the objects
> plot(reg1)
5. STATISTICS
The concepts of randomness and probability are central to statistics. It is an empirical fact that
most experiments and investigations are not perfectly reproducible. The degree of
irreproducibility may vary: Some experiments in physics may yield data that are accurate to
many decimal places, whereas data on biological systems are typically much less reliable.
However, the view of data as something coming from a statistical distribution is vital to
understanding statistical methods. In this section, we outline the basic ideas of probability and
the functions that R has for random sampling and handling of theoretical distributions.
Notice that the default behaviour of sample is sampling without replacement. That is, the
samples will not contain the same number twice, and size obviously cannot be bigger than the
length of the vector to be sampled. If you want sampling with replacement, then you need to add
the argument replace=TRUE.
> sample(c("H","T"), 10, replace=T)
[1] "T" "H" "T" "H" "T" "T" "H" "H" "H" "H"
As describe in the previous section, R provides a wide range of functions for obtaining summary
statistics: mean, sd, var, min, max, median, range, and quantile.
> data <- data.frame(matrix(round(runif(50, 4, 34)), # Create example data
+ nrow = 10, ncol = 5))
> data
X1 X2 X3 X4 X5
1 7 25 13 18 21
2 23 20 13 12 23
3 22 12 9 13 13
4 23 32 5 19 23
5 30 13 11 9 14
6 23 29 28 27 19
7 4 13 20 10 24
8 11 12 31 12 19
9 24 10 29 34 11
10 19 11 5 28 27
Min. : 4.0 Min. :10.00 Min. : 5.0 Min. : 9.0 Min. :11.00
1st Qu.:13.0 1st Qu.:12.00 1st Qu.: 9.5 1st Qu.:12.0 1st Qu.:15.25
Median :22.5 Median :13.00 Median :13.0 Median :15.5 Median :20.00
Mean :18.6 Mean :17.70 Mean :16.4 Mean :18.2 Mean :19.40
3rd Qu.:23.0 3rd Qu.:23.75 3rd Qu.:26.0 3rd Qu.:25.0 3rd Qu.:23.00
Max. :30.0 Max. :32.00 Max. :31.0 Max. :34.0 Max. :27.00
> describe(data)
data
5 Variables 10 Observations
-----------------------------------------------------------------------------
--------------------------------------------
X1
n missing distinct Info Mean Gmd
10 0 8 0.976 18.6 9.467
Value 4 7 11 19 22 23 24 30
Frequency 1 1 1 1 1 3 1 1
Proportion 0.1 0.1 0.1 0.1 0.1 0.3 0.1 0.1
-----------------------------------------------------------------------------
--------------------------------------------
X2
n missing distinct Info Mean Gmd
10 0 8 0.988 17.7 9.178
Value 10 11 12 13 20 25 29 32
Frequency 1 1 2 2 1 1 1 1
Proportion 0.1 0.1 0.2 0.2 0.1 0.1 0.1 0.1
-----------------------------------------------------------------------------
--------------------------------------------
X3
n missing distinct Info Mean Gmd
10 0 8 0.988 16.4 11.64
Value 5 9 11 13 20 28 29 31
Frequency 2 1 1 2 1 1 1 1
Proportion 0.2 0.1 0.1 0.2 0.1 0.1 0.1 0.1
-----------------------------------------------------------------------------
--------------------------------------------
X4
n missing distinct Info Mean Gmd
10 0 9 0.994 18.2 10.04
Value 9 10 12 13 18 19 27 28 34
Frequency 1 1 2 1 1 1 1 1 1
Proportion 0.1 0.1 0.2 0.1 0.1 0.1 0.1 0.1 0.1
-----------------------------------------------------------------------------
--------------------------------------------
X5
n missing distinct Info Mean Gmd
10 0 8 0.988 19.4 6.222
Value 11 13 14 19 21 23 24 27
Frequency 1 1 1 2 1 2 1 1
Proportion 0.1 0.1 0.1 0.2 0.1 0.2 0.1 0.1
-----------------------------------------------------------------------------
--------------------------------------------
A common application of loops (we are going to learn looping in the next section) is to apply a
function to each element of a set of values or vectors and collect the results in a single
structure. In R this is abstracted by the functions lapply() and sapply(). The former always
returns a list (hence the 8l9), whereas the latter tries to simplify (hence the 8s9) the result to a
vector or a matrix if possible.
> sapply(data, mean, na.rm=TRUE)
X1 X2 X3 X4 X5
18.6 17.7 16.4 18.2 19.4
> sapply(data, sd, na.rm=TRUE)
X1 X2 X3 X4 X5
8.395766 8.192815 9.924157 8.689713 5.253570
Sometimes you just want to repeat something a number of times but still collect the results as a
vector. Obviously, this makes sense only when the repeated computations actually give different
results, the common case being simulation studies. This can be done using sapply, but there is
a simplified version called replicate(), in which you just have to give a count and the expression
to evaluate.
> replicate(5, mean(data[,1]))
[1] 18.6 18.6 18.6 18.6 18.6
A similar function, apply, allows you to apply() a function to the rows or columns of a matrix (or
over indices of a multidimensional array in general
> apply(data,2,max)
X1 X2 X3 X4 X5
30 32 31 34 27
The second argument is the index (or vector of indices) that defines what the function is applied
to; in this case we get the columnwise maxima.
Empirical quantiles may be obtained with the function quantile().
> quantile(data[,1])
0% 25% 50% 75% 100%
4.0 13.0 22.5 23.0 30.0
It is also possible to obtain other quantiles; this is done by adding an argument containing the
desired percentage points.
> pvec <- seq(0,1,0.1)
> pvec
[1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
> quantile(data[,1],pvec)
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
4.0 6.7 10.2 16.6 20.8 22.5 23.0 23.0 23.2 24.6 30.0
How to create random numbers from the uniform density? As describe in Section 3, in R <unif=
means uniform and prefix:
d means density
p means cumulative probability
q means quantile
r means random numbers from that density
Distributions (R name)
The Binomial probability for exactly x successes when the probability of one success in one
trial is p and when the number of trials is n. Note that n also equals the largest number of
successes.
> p=0.5; n=3; x=0:n
> db=dbinom(x,prob=p,size=n);db
[1] 0.125 0.375 0.375 0.125
The Poisson distribution can be described as the limiting case of the binomial distributions
when the size parameter N increases while the expected number of successes » = Np is fixed.
This is useful to describe rare event in large populations. The code name for Poisson is `pois'
and all the same prefixes (d,p,q,r) mean the same thing as they did for the uniform density.
Turning now to the normal distribution, I will start with the generation of an artificial data vector
x of 150 normally distributed observations. It is used in examples throughout this section.
The density function is likely the one of the four function types that is least used in practice, but
if for instance it is desired to draw the well-known bell curve of the normal distribution, then it
can be done like this:
> x1 <- seq(-4,4,0.1)
> plot(x,dnorm(x1),type="l")
The empirical cumulative distribution function is defined as the fraction of data smaller than or
equal to x. That is, if x is the k-th smallest observation, then the proportion k/n of the data is
smaller than or equal to x (7/10 if x is no. 7 of 10).
> n <- length(x)
> plot(sort(x),(1:n)/n,type="s",ylim=c(0,1))
One purpose of calculating the empirical cumulative distribution function (c.d.f.) is to see
whether data can be assumed normally distributed. For a better assessment, you might plot the
k-th smallest observation against the expected value of the k-th smallest observation out of n in
a standard normal distribution. The point is that in this way you would expect to obtain a straight
line if data come from a normal distribution with any mean and standard deviation. You can do
this plot with the qqnorm().
Create a dataset and compute the Z-scores. Z-scores are used to calculate the probability of
a score occurring within a normal distribution, and allow us to compare two or more variables
from different normal distributions.
> students = c("S1", "S2", "S3", "S4", "S5", "S6", "S7", "S8", "S9", "S10", "
S11", "S12", "S13", "S14", "S15")
> scores <- c(60, 62, 63, 66, 67, 70, 69, 69, 68, 64, 65, 71, 62, 63, 64)
> newdata <- as.data.frame(cbind(students, scores))
> pop_sd <- sd(scores)*sqrt((length(scores)-1)/(length(scores)))
> pop_mean <- mean(scores)
> Z_scores <- (scores-pop_mean)/pop_sd
> Z_scores
Add the Z-scores to the original dataset and export it as a <.csv= or <.txt= file.
> New_ZScoreData <- cbind(newdata,Z_score)
> write.csv(New_ZScoreData,"New_ZScore.csv")
> write.table(New_ZScoreData, "New_ZScore.txt")
The number of intervals (called "bins") is an important parameter of histograms because it splits
the data into equal parts. Each and every observation (or value) in the data set is placed in the
appropriate bin.
> x1 = c(10.2, 11.9, 11.3, 12.2, 12.7, 12.8, 14.3, 14.5, 14.6, 15.9,
+ 14.8, 15.0, 15.5, 13.2, 13.9, 18.5, 18.9, 18.4, 18.9,
+ 19.0, 19.5, 16.1, 16.2, 16.5, 16.8, 16.9, 16.7, 17.3,
+ 20.2, 20.5, 20.9, 20.8, 20.2, 22.5, 22.7, 22.9)
> hist(x1, breaks=7, main="Simulated Data - x1")
> x2 = x1*rnorm(length(x1), mean=1, sd=1)
> hist(x2, breaks=7, main="Simulated Data - x2")
Example
dnorm(x, mean = 0, sd = 1, log = FALSE)
pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
rnorm(n, mean = 0, sd = 1)
The pnorm(0) =0.5 is the area under the standard normal curve to the left of zero. Likewise, the
qnorm(0.80) = 0.84 means that 0.84 is the 80th percentile of the standard normal distribution.
rnorm(100) generates 100 random deviates from a standard normal distribution (mean=0, and
standard deviation=1). Each function has parameters specific to that distribution. For example,
rnorm(100, m=50, sd=10) generates 100 random deviates from a normal distribution with mean
50 and standard deviation 10.
The fitdistr( ) function in the MASS package provides maximum-likelihood fitting of univariate
distributions. The format is fitdistr(x, densityfunction) where x is the sample data and
densityfunction is one of the following: "beta", "cauchy", "chi-squared", "exponential", "f",
"gamma", "geometric", "log-normal", "lognormal", "logistic", "negative binomial", "normal",
"Poisson", "t" or "weibull".
Exercise 1: Children's IQ scores are normally distributed with a mean of 100 and a standard
deviation of 15. What proportion of children are expected to have an IQ between 80 and 120?
> mean=100; sd=15
> lb=80; ub=120
> x <- seq(-4,4,length=100)*sd + mean
> hx <- dnorm(x,mean,sd)
> plot(x, hx, type="n", xlab="IQ Values", ylab="",
+ main="Normal Distribution", axes=FALSE)
> i <- x >= lb & x <= ub
> lines(x, hx)
> polygon(c(lb,x[i],ub), c(0,hx[i],0), col="red")
> area <- pnorm(ub, mean, sd) - pnorm(lb, mean, sd)
> result <- paste("P(",lb,"< IQ <",ub,") =",signif(area, digits=3))
> mtext(result,3)
> axis(1, at=seq(40, 160, 20), pos=0)
Assume the population standard deviation σ of the student height in survey is 9.48. Find the
margin of error and interval estimate at 95% confidence level.
> height.response = na.omit(survey$Height)
> n = length(height.response)
> sigma = 9.48 # population standard deviation
> sem = sigma/sqrt(n); sem # standard error of the mean
[1] 0.6557453
> E = qnorm(.975)∗sem; E # margin of error
[1] 1.285237
> E
[1] 1.285237
> xbar = mean(height.response) # sample mean
> xbar + c(−E, E)
[1] 171.0956 173.6661
Without assuming the population standard deviation of the student height in survey, find the
margin of error and interval estimate at 95% confidence level.
> height.response = na.omit(survey$Height)
> n = length(height.response)
> s = sd(height.response) # sample standard deviation
> SE = s/sqrt(n); SE # standard error estimate
[1] 0.6811677
> E = qt(.975, df=n−1)∗SE; E
[1] 1.342878
> xbar = mean(height.response) # sample mean
> xbar + c(−E, E)
[1] 171.0380 173.7237
Sampling Size of Population Mean. The quality of a sample survey can be improved by
increasing the sample size.
Assume the population standard deviation σ of the student height in survey is 9.48. Find the
sample size needed to achieve a 1.2 centimeters margin of error at 95% confidence level.
> zstar = qnorm(.975)
> sigma = 9.48
> E = 1.2
> zstar^2 ∗ sigma^2/ E^2
[1] 239.7454
data: x
t = 0.70372, df = 49, p-value = 0.4849
alternative hypothesis: true mean is not equal to 10
95 percent confidence interval:
9.924374 10.157135
sample estimates:
mean of x
10.04075
data: x and y
t = -0.42744, df = 14.721, p-value = 0.6752
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.4379188 0.9581946
sample estimates:
mean of x mean of y
-0.04322973 0.19663237
data: x and y
W = 35, p-value = 0.1272
alternative hypothesis: true location shift is greater than 0
Shapiro Test
To test if a sample follows a normal distribution. The null hypothesis here is that the sample
being tested is normally distributed.
data: normaly_disb
W = 0.98836, p-value = 0.535
Kolmogorov-Smirnov Test
Kolmogorov-Smirnov test is used to check whether 2 samples follow the same distribution.
> x <- rnorm(50)
> y <- runif(50)
> ks.test(x, y)
data: x and y
D = 0.64, p-value = 6.079e-10
alternative hypothesis: two-sided
data: x and y
D = 0.12, p-value = 0.8693
alternative hypothesis: two-sided
Fisher9s F-Test
F-Test is used to check if two samples have the same variance.
> var.test(x,y)
data: x and y
F = 0.90857, num df = 49, denom df = 49, p-value = 0.7385
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.5155919 1.6010719
sample estimates:
ratio of variances
0.9085702
Chi-Squared Test
This test is used to examine if two categorical variables are dependent. The input data is in the
form of a table that contains the count value of the variables in the observation. We use
chisq.test() function to perform the chi-square test of independence.
> data_frame <- read.csv("https://goo.gl/j6lRXD")
> table(data_frame$treatment, data_frame$improvement)
improved not-improved
not-treated 26 29
treated 35 15
> chisq.test(data_frame$treatment, data_frame$improvement, correct=FALSE)
In the built-in data set survey, the Smoke column records the students smoking habit, while the
Exer column records their exercise level. The allowed values in Smoke are "Heavy", "Regul"
(regularly), "Occas" (occasionally) and "Never". As for Exer, they are "Freq" (frequently), "Some"
and "None".
We can tally the students smoking habit against the exercise level with the table function in R.
The result is called the contingency table of the two variables.
Question: Test the hypothesis whether the students smoking habit is independent of their
exercise level at .05 significance level.
> tbl = table(survey$Smoke, survey$Exer)
> tbl # the contingency table
Freq None Some
Heavy 7 1 3
Never 87 18 84
Occas 12 3 4
Regul 9 1 7
> chisq.test(tbl)
data: tbl
X-squared = 5.4885, df = 6, p-value = 0.4828
6. LOOPING
It is possible to write your own R functions (see section 3). Also, if you are likely to use a set of
calculations more than once, you would be well advised to present the code in such a way that it
can be reused with minimal typing. Quite often, this brings you into the world of functions and
loops. In fact, looping is a major aspect and attraction of working with the system in the long run.
Recall what we have learnt:
Operators
While Loop
> i <- 1
> while (i < 6) {
+ print(i)
+ i = i+1
+ }
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
For Loop
> x <- c(2,5,3,9,8,11,6)
> count <- 0
> for (val in x) {
+ if(val %% 2 == 0) count = count+1
+ }
> print(count)
[1] 3
Break Statement
> x <- 1:5
> for (val in x) {
+ if (val == 3){
+ break
+ }
+ print(val)
+ }
[1] 1
[1] 2
Next Statement
Repeat Loop
> x <- 1
> repeat {
+ print(x)
+ x = x+1
+ if (x == 5){
+ break
+ }
+ }
[1] 1
[1] 2
[1] 3
[1] 4
> asim<-10
> xbar<-rep(NA,asim)
> xvar<-rep(NA,asim)
> for(i in 1:asim)
+ {set.seed(i)
+ # x contains a random sample of size n of the variable X
+ x<-rnorm(n,mu,sigma)
+ xbar[i]<-mean(x)
+ xvar[i]<-var(x)
+ }
Is the sample mean an unbiased estimator for the population mean? Is the sample variance
also an unbiased estimator for the population variance? Increase the number of simulations to
100 and then finally to 1,000. What do you notice?
> mu<-2
> sigma<-2
> n<-10
> asim<-1000
> xbar<-rep(NA,asim)
> xbar2<-rep(NA,asim)
> xbar3<-rep(NA,asim)
> xvar<-rep(NA,asim)
> xvar2<-rep(NA,asim)
> xvar3<-rep(NA,asim)
> for(i in 1:asim) {
+ set.seed(i)
+ x<-rnorm(n,mu,sigma)
+ x2<-rnorm(n*10,mu,sigma)
+ x3<-rnorm(n*100,mu,sigma)
+ xbar[i]<-mean(x)
+ xbar2[i]<-mean(x2)
+ xbar3[i]<-mean(x3)
+ xvar[i]<-var(x)
+ xvar2[i]<-var(x2)
+ xvar3[i]<-var(x3)
+ }
> mean(xbar)
[1] 2.003151
> mean(xbar2)
[1] 1.999852
> mean(xbar3)
[1] 2.002405
> sd(xbar)
[1] 0.6463367
> sd(xbar2)
[1] 0.1908427
> sd(xbar3)
[1] 0.06513537
2.2
250
2.1
200
Sample Quantiles
Frequency
150
2.0
100
1.9
50
1.8
0
Now, let9s assume that the variable X is no longer normally distributed but follows a chi-squared
distribution with 4 degrees of freedom.
> n<-10
> asim<-1000
> xbar<-rep(NA,asim)
> xbar2<-rep(NA,asim)
> xbar3<-rep(NA,asim)
> xvar<-rep(NA,asim)
> xvar2<-rep(NA,asim)
> xvar3<-rep(NA,asim)
> for(i in 1:asim) {
+ set.seed(i)
+ x<-rchisq(n,4)
+ x2<-rchisq(n*10,4)
+ x3<-rchisq(n*100,4)
+ xbar[i]<-mean(x)
+ xbar2[i]<-mean(x2)
+ xbar3[i]<-mean(x3)
+ xvar[i]<-var(x)
+ xvar2[i]<-var(x2)
+ xvar3[i]<-var(x3)
+ }