Introduction To R 2022
Introduction To R 2022
HSTS204/HASC216/HASTS202
STATISTICAL COMPUTING
Introduction to R
R is a system for statistical computation and graphics. R provides, among other things, a
programming language, high level graphics, interfaces to other languages and debugging
facilities.
The R language is a dialect of S which was designed in the 1980s and has been in widespread
use in the statistical community since. The language syntax has a superficial similarity with C.
R is open source software and its home page is http://www.R-project.org/. The base R
distribution contains functions and data to implement and illustrate most common statistical
procedures, including regression and ANOVA, classical parametric and nonparametric tests,
cluster analysis, density estimation and much more.
The system processes commands entered by the user, who types the commands at the
command prompt, or submits the commands from a file called a script to save retyping and to
separate commands from results. In a window system, users interact with R through the R
console.
Packages
An R installation contains a library of packages. Some of these packages are part of the basic
installation and others can be downloaded from Comprehensive R Archive Network (CRAN)
sites through mirror sites. We use the South African mirror site for downloads. You can create
your own packages.
A package is loaded into R using the library command, e.g. library(survival). The loaded
packages are not considered part of the workspace and if you terminate your session you need
to load them again when you start a new session.
Basics
The R commands are entered at the prompt in the R console window. The prompt character is >
and when a line is continued the prompt changes to +.
L siziba(UZ) 2022
[Document title]
R is case sensitive.
Assignments
In every computer language ‘variables’ provide a means of accessing the data stored in memory.
So one has to name ‘things’ that you would want to use or refer to in future.
The right-to-left assignment operators are the left arrow <- and equal sign=.
Please NOTE: On specifying the file path we use / (forward slash) or (\\) and not \.
Working directory :
To view the current working directory type getwd(). To change the working directory type
setwd(“pathname”). Change your working directory to the HSTS204 course folder you created.
Built in data
R has a lot of inbuilt data sets and some are contained in the ISwR package.
To load these you need to be connected to the internet in type the following command in an R
session:
install.packages(“ISwR”, .libPaths()[1])
The R Graphical User Interface has a Help menu to find and display online documentation for R
objects, methods, datasets, and functions. Through the Help menu one can find several manuals
in PDF form, an html help page, and help search utilities. The help search utility functions are
also available at the command line, using the functions help where you type help("keyword")
which displays help for “keyword” and help.search using help.search("keyword") which
searches for all objects containing “keyword” and the corresponding shortcuts are ? and ??
respectively. The quotes are optional in the help command, but would be required for special
characters and are required in the help.search command.
Example Type :
R example/Tutorial
R also provides a function example that runs all of the examples if any exist for the keyword. To
see the examples for the function mean, type example(mean).
L siziba(UZ) 2022
[Document title]
Session management
The workspace
All variables stored in R are stored in a common workspace. To see the variables that are
defined in a workspace, type ls() (list).
It is possible to delete some of the objects using the command rm(x,y,z) (remove).
It is possible to save the workspace to a file at any time using: save.image() and it will be saved
with file extension .RData.
All the commands typed in an R session are saved upon exit in a file called .Rhistory under the
working directory. You can use a text editor to edit the .Rhistory.
R objects
R does not provide direct access to the computer’s memory but rather provides a number
of specialized data structures we will refer to as objects. The entities that R creates and
manipulates are known as objects. These objects are referred to through
symbols or variables. In R, however, the symbols are themselves objects and can be
manipulated in the same way as any other object.
During an R session, objects are created and stored by name . The R command
> objects()
(alternatively, ls()) can be used to display the names of (most of) the objects which are
currently
stored within R. The collection of objects currently stored is called the workspace.
1) Vectors
Vectors can be thought of as contiguous cells containing data. Cells are accessed through
indexing operations such as x[5] means the 5th observation of the vector x.
R has six basic (‘atomic’) vector types: logical, integer, real, complex, string (or character)
and raw.
Single numbers, such as 4.2, and strings, such as "four point two" are still vectors, of length
1; there are no more basic types. Vectors with length zero are possible (and useful).
String vectors have mode and storage mode "character". A single element of a character
vector is often referred to as a character string.
2) Lists
Lists (“generic vectors”) are another kind of data storage. Lists have elements, each of which
can contain any type of R object, i.e. the elements of a list do not have to be of the same type.
List elements are accessed through three different indexing operations.
Lists are vectors, and the basic vector types are referred to as atomic vectors where it is
necessary to exclude lists.
3) Language objects
L siziba(UZ) 2022
[Document title]
There are three types of objects that constitute the R language. They are calls, expressions,
and names. These objects have modes "call", "expression", and "name", respectively.
They can be created directly from expressions using the quote mechanism and converted to
and from lists by the as.list and as.call functions.
Symbol objects
Symbols refer to R objects. The name of any R object is usually a symbol. Symbols can be
created through the functions as.name and quote.
4) Expression objects
An expression contains one or more statements.
5) Function objects
In R functions are objects and can be manipulated in much the same way as any other object.
Functions (or more precisely, function closures) have three basic components:
-a formal argument list: the argument list is a comma-separated list of arguments;
-a body : The body is a parsed R statement which is usually a collection of statements in braces
(‘{’ and ‘}’), but it can be a single statement, a symbol or even a constant
and an environment: a function’s environment is the environment that was active at the time
that the function was created. The syntax for writing a function is function ( arglist ) body
The function declaration is the keyword function which indicates to R that you want to create a
function.
6) NULL
There is a special object called NULL. It is used whenever there is a need to indicate or specify
that an object is absent. It should not be confused with a vector or list of zero length.
The NULL object has no type and no modifable properties.
7) Builtin objects and special forms
These two kinds of object contain the builtin functions of R, i.e., those that are displayed as
.Primitive in code listings (as well as those accessed via the .Internal function and hence not
user-visible as objects). The difference between the two lies in the argument handling. Builtin
functions have all their arguments evaluated and passed to the internal function, in accordance
with call-by-value, whereas special functions pass the unevaluated arguments to the internal
function.
The other objects include: Promise objects, Dot-dot-dot, Pairlist objects and Environments
Environments can be thought of as consisting of two things. A frame, consisting of a set of
symbol-value pairs, and an enclosure, a pointer to an enclosing environment.
(i)Factors
Factors are used to describe items that can have a finite number of values (categorical
variables). A factor may be purely nominal or may have ordered categories.
(ii)Data frame objects
Data frames are the R structures which most closely mimic the SAS or SPSS data set, i.e. a
“cases by variables” matrix of data.
A data frame is a list of vectors, factors, and/or matrices all having the same length (number
of rows in the case of matrices). In addition, a data frame generally has a names for the
variables.
Objects Attributes
All objects except NULL can have one or more attributes attached to them. Attributes are stored
as a pairlist where all elements are named, but should be thought of as a set of name=value
pairs.
The following are the basic attributes of an object:
Names:A names attribute, when present, labels the individual elements of a vector or list. When
an object is printed the names attribute, when present, is used to label the elements.
L siziba(UZ) 2022
[Document title]
Dimensions: The dim attribute is used to implement arrays. The content of the array is stored in
a vector in column-major order and the dim attribute is a vector of integers specifying the
respective extents of the array. R ensures that the length of the vector is the product of the
lengths of the dimensions. For example Matrices and arrays are simply vectors with the
attribute dim attached to the vector. A dimension vector is a vector of non-negative integers
Dimnames:Arrays may name each dimension separately using the dimnames attribute which is
a list of character vectors.
Classes: R has an elaborate class system1, principally controlled via the class attribute. This
attribute
is a character vector containing the list of classes that an object inherits from. This forms the
basis of the “generic methods” functionality in R.
Time series attributes: The tsp attribute is used to hold parameters of time series, start, end,
and frequency. This construction is mainly used to handle series with periodic substructure
such as monthly or quarterly data.
Execution of commands in R
When a user types a command at the prompt (or when an expression is read from a file), the
command is transformed by the parser/compiler into an internal representation and the
evaluator executes parsed R expressions and returns the value of the expression. All
expressions have a value. This is the core of the language.
Data Entry
Basics
Recall that R has objects and modes. Objects are anything that you can give a name. There are
many different classes of objects. The main classes of interest here are vector, matrix, factor, list,
and data frame. The mode of an object tells what kind of things are in it. The main modes of
interest here are logical ,numeric, and character.
(i) Typing
(a) c eg c(2,4,6,8,10)
(ii) Character vectors- c(“Gerald”, “Peter”, “Alfred”, “Mildred”, “Tafadzwa”,) : a vector of text
string elements which should be specified and printed in quotes, does not matter whether single
or double
(iii) Logical vectors- c(T,T,F,T,F) : can take the value TRUE or FALSE or (NA).
(b) seq (sequence) : Used for equidistant series of numbers, e.g. seq(2, 10) , or seq(4,20,2)
L siziba(UZ) 2022
[Document title]
(c) rep (replicate): Used to generate repeated values, x<-c(5,10,15), rep(x,4), or rep(x, 1:3) or
rep(1:3, c(8,10,12))
This the most convenient way of reading data into R. Use the command : read.table(path,
header=T). It requires the data to be in an ASCII (American Standard Code for Information
Interchange) which a format created by any plain editor such as Windows NotePad. This results
in a data frame. The first line of the data can contain a header .
The read.table command assumes fields are separated by whitespaces. Variants of the
command are : read.csv and read.csv2 which assume that fields are separated by comma and
semicolons respectively. Another variant is read.delim or read.delim2 for reading delimited
files for which the default delimiter is the Tab character.
The simplest way is to request the package to export data as a text file (one of the forms state in
(ii) above). Alternatively the foreign package is recommended for handling other formats like
SPSS, SAS, STATA, Minitab etc.
Vectorised arithmetic
You can do calculations with vectors just like ordinary numbers and operations are applied
element by element.
>bmi<-weight/height^2
>bmi
Data frames
A data frame corresponds to what is commonly referred to as a data matrix or a data set. It is a
list of vectors and/or factors of the same length which are associated across.
e.g >y1=c(1,2,3,4,5)
L siziba(UZ) 2022
[Document title]
>y3=c(7,8,9,10,11)
>ydata=data.frame(y1,y2,y3)
A factor is a vector object used to specify a discrete classification (grouping) of the components
of other vectors of the same length. R provides both ordered and unordered factors. A factor is
similarly created using the factor() function applied on a vector of numbers or characters.
Example: We want to capture the sex of the respondents in the data set in the table below.
>sex=c( “Female”, “Male” ,“Female” ,“Female”, “Male”, “Female”, “Female”, “Male”, “Male”)
>sexf = factor(sex)
>sexf
The command below can be used to get the levels of the factor directly without listing the
factor.
> levels(sexf)
Alternative we can create the factor by specifying the vector of values, the levels and labels.
>ben=c(2,1,1,1,2,1,1,1,1,1)
>possible.ben=c(1,2)
>labels.ben=c(“Beneficiary”,”Non Beneficiary”)
L siziba(UZ) 2022
[Document title]
(i) Using the command data.entry : Allows you to edit variables in the workspace.
e.g >data.entry(x,y,z) NB this works if the variables are already in the work space.
(ii) Using the (a)fix function: This command requires you to call the data frame to display using
the command :
>fix(dataframe name)
Alternative you can use the Edit menu, then Data editor , there after you are prompted to select the
data frame name.
>data() was originally intended to allow users to load datasets from packages for use in their
examples. This function lists all available example data sets in the base package and
> data(package=”survival”) will list all example datasets in the survival package.
Missing values
R allows vectors to contain a special value NA and computations and operations on it yield NA
as the result.
>str(Dataframe_name)# to get the structure of the file i.e dimension as well as the name and
type of each variable.
>names(dataframe_name) # displays variable names
>dim(dataframe_name) # displays dimension of the data frame, number of rows and number
of variables
If we attach the data frame, the variables can be referenced directly by name, without the dollar
sign operator.
L siziba(UZ) 2022
[Document title]
> attach(dataframe_name)
> variable_name
Indexing
Used for selection of data in a vector e.g. z<-(5:12), z[6] will give the element sitting on position
6 of the vector z.
Indexing can also be used to select data in a data frame, e.g. d[6,5], will report the value of the
5th variable for the 6th subject in the data frame d and d[6,] will display the elements in the 6th
row of data frame d.
>d[d$v1>1,] can be used to select cases with v1>1 in the data set.
The transform function can be used to add/append transformed variables to the data set, eg
>transform(dataframe,newv1=log(v1))
>n2=subset(n1, select=c(x, z)) or N2=subset(n1, select=-y) would produce a data frame n2 with
variable y dropped.
The subset command can also be used to select a data set that satisfies a condition:
>n3=subset(n1, subset=x>5) would produce a data frame n3 with only cases where x>5
>n1$newvar=c(1,2,3,4,5,5) . Please note that the number of elements in the new vector has to be
equal to the number of elements in the data frame.
Descriptive Statistics
The function table() allows frequency tables to be calculated from equal length factors.
L siziba(UZ) 2022
[Document title]
The commands table, xtable and ftable are used for tabulating numeric vectors as well as
factor variables when the data is presented in a unit-wise database.
The margin.table and prop.table commands are used to get marginal and relative frequencies
, respectively, for multiway tables. Eg margin.table(a,2) would give marginal frequencies for
columns .
In R matrices and arrays are represented as vectors with dimensions. An array can be
considered as a multiply subscripted collection of data entries. Matrices can be created using
different functions:
y<-1:12
dim(x)<-c(3,4)
(iii) cbind and rbind ‘glue’ vectors together columnwise or rowwise respectively,
Matrix operations :
> A %*% B
>eigen(x)#eigen values-vectors
Graphics
One of the most attractive features of R is that it gives a fine control of graphic components. You
can specify the plotting parameters like, plotting characters, line types etc. Learn more using the
help function.
Plot
e.g plot (x,y) , use any two variables of your choice, plot (x,y, pch=3,col=”red”, type=”o”)
L siziba(UZ) 2022
[Document title]
type = default=p (points), “l” for lines, “o”=overplotted, “b” both points
and lines
For a:
The barplot()
For a:
matrix: one bar for each column, summing successive values in colours
data.frame: error
Boxplot
For a:
e.g. >par(mfrow=c(3,2)) sets the display for 3 rows and 2 columns, filled one row at a time
while >par(mfcol=c(3,2)) fills in the entries column wise.
Eg:
>par(mfrow=c(2,2))
brk=10*1:10
v1=c(1,2,3,4,5,6,7,8,9,10)
v2=c(11,12,13,14,15,16,17,18,19,20)
v3=c(1,12,31,41,15,6,7,8,9,10)
L siziba(UZ) 2022
[Document title]
plot(v3, v2, pch=12,col = "blue", xlab="X axis", ylab="y axis", main="The Graph")
Probability functions:
R provides functions for the density, cumulative distribution function (CDF), percentiles, and for
generating random variates for many commonly applied distributions. For the Poisson
distribution these functions are dpois, ppois, qpois, and rpois, respectively
L siziba(UZ) 2022
[Document title]
R programming
The main attracting feature of R is that It is possible to write your own R functions. R is a true
programming language that allows conditional execution and looping constructs.
Functions
R users interact with the software primarily through functions. The syntax of a function is
Where f is the name of the function, x is the name of the first argument (there can be several
arguments), and... indicates possible additional arguments.
Functions can be defined with no arguments, also. The curly brackets enclose the body of the
function. The return value of a function is the value of the last expression evaluated.
Conditional Statements
if(cond) expr
If the condition is TRUE then the expression executed.
####### IF STATEMENTS
L siziba(UZ) 2022
[Document title]
# IF
x<-1
if(x == 1) x<-x+1
print(x)
x<-7
if(x =< 10 & 5 =< x){
print("X is between 5 and 10")
}
x<-7
if(x > 10 | 5 > x){
print("X is outside 5 and 10")
}
# Else statements
x<-5
if(x > 10 | 5 > x){
print("X is outside 5 and 10")
}else{
print("X is between 5 and 10")
}
# If/Else ladders
x <- 0
if (x < 0) {
print("Negative number")
} else if (x > 0) {
print("Positive number")
} else
print("Zero")
LOOPS
Loop=cycling or iterating.
A loop is a programming structure that repeats a sequence of instructions until a specific condition is
met.
Loop statements usually have four components: initialization (usually of a loop control variable),
continuation test on whether to do another iteration, an update step, and a loop body.
L siziba(UZ) 2022
[Document title]
The control statement is a combination of some conditions that direct the body of the loop to
execute until the specified condition becomes false
A loop statement tests a condition and enters the body of the loop or exits the loop based on the
test result (true or false): typical loops are for, while, do-while.
1. Entry controlled loop : a condition is checked before executing the body of a loop. It is also
called as a pre-checking loop.
2. Exit controlled loop: a condition is checked after executing the body of a loop. It is also
called as a post-checking loop.
WHILE loop
FOR loop
for(var in sequence) expr
(for (initial value; condition; incrementation or decrementation)
{
statements;
})
L siziba(UZ) 2022
[Document title]
FLOW CHART
REPEAT loop
repeat expr
A repeat loop is used to iterate over a block of code multiple number of times.
There is no condition check in repeat loop to exit the loop. Therefore we must put a condition
explicitly inside the body of the loop and use the to exit the loop.
FLOW CHART
L siziba(UZ) 2022
[Document title]
(i) A next statement is useful when we want to skip the current iteration of a loop without
terminating it. On encountering next, the R compiler skips further evaluation and starts
next iteration of the loop. That is, next : halts the processing of the current iteration and
advances the looping index
(ii) A break statement is used inside a loop to stop the iterations and flow the control
outside of the loop or control is transferred to the first statement outside the inner-
most loop
In a nested looping situation, where there is a loop inside another loop, this statement exits
from the innermost loop that is being evaluated.
Martin examples
for(i in c(1,3,6,7)){
print(vec1[i])
}
vec2<-numeric(0)
for(i in c(1,3,5,7,9)){
vec2[i]<-i
}
## WHILE loops
L siziba(UZ) 2022
[Document title]
counter<-1
while(counter<10){
counter<-counter+1
}
print(counter)
counter<-1
while(counter %% 5 != 0){
counter<-counter+1
}
for (i in seq_along(x)){....}
In summary for R: vectorization (vectorised calculations), will be much faster than applying the
same function to each element of the vector individually, so loops are slower in R
Apply function in R
# sapply - apply function over vector/ apply over an object and return a simplified object (an
array) if possible
# lapply - apply function over lists/ apply over an object and return list
# apply - apply function over matrices/ apply over the margins of an array (e.g. the rows or
columns of a matrix)
Examples
x<-1:10
y<-matrix(1:9,nrow = 3)
z<-list(-1,2,5,-7,5,8,10,-3)
x<-2:5
n<-3:6
L siziba(UZ) 2022
[Document title]
Enter the income vector in the same order as the categorising variable:
>income=c(1200,380,900,482,2400,800,680,800,450,720)
The lm function is used to fit linear models. The argument of lm is a model formula, in which
thetilde symbol (~) stands for “described by”. E.g
eg model3=lm (y~x1+x2-1)
The lm function produces a model object where more information about the model can be
obtained using extractor functions.
The most basic extractor function is summary (model) gives the fitted model coefficients and
other summary statistics.
predict(model, int=”c”) #would produce confidence bands (narrow bands) which reflect
uncertainity about the line itself
predict(model, int=”p”) #would produce prediction bands (wide bands) which include
uncertainity about the future observations
EXAMPLE
>y<-c(38,39,36,45,33,43,38,38,27,34,24,32,31,21,28)
>x<-c(21,26,22,28,19,34,26,29,18,25,23,29,30,16,29)
L siziba(UZ) 2022
[Document title]
>model=lm(y~x)
>pred.frame= data.frame(x=35:40)
>plot(x,y,ylim=range(y, pp))
>abline(lm(y~x))
>pred.x=pred.frame$x
> matlines(pred.x,pc,lty=c(1,2,2),col="red")
R Scripts
R commands can be placed in a file, called an R script, and can be run using source or copy paste.
Using the source function causes R to accept input from the named source, such as a file.
In the R GUI users can open a new script window through the File menu. R scripts will be saved
with extension .R.
Using the source function, auto-printing of expressions does not happen and we need to add the
print statements to the script so that the values of objects will be printed. The command you
type is
Example
L siziba(UZ) 2022
[Document title]
print(p)
r =sum(k *p) #mean
v =sum(x *(k - r)^2) / 199 #variance
print(r)
print(v)
f =dpois(k, r)
print(cbind(k, p, f))
On the R console type command source("trialdata.R")
• Select lines and click the button ‘Run line or selection’ on the toolbar.
• Copy the lines, and then paste the lines at the command prompt.
• (For Windows users:) To execute one or more lines of the file in the R
GUI editor, select the lines and type Ctrl-R.
L siziba(UZ) 2022