07 Introduction To R
07 Introduction To R
1
CENTRAL LUZON STATE UNIVERSITY
DEPARTMENT of
STATISTICS
Learning Outcomes
After completing this chapter, the students must be able to
• Navigate the R system
• Enter and import data
• Manipulate datasets
Introduction to R | 2
DEPARTMENT of
STATISTICS R Fundamentals
What is R?
• R is a statistical analysis and graphics environment and also a programming
language.
• It is command-driven and very similar to the commercially produced S-
Plus® software.
• R is known for its professional-looking graphics, which allow complete
customization.
• R is open-source software and free to install under the GNU general public
license.
• It is written and maintained by a group of volunteers known as the R core
team.
Introduction to R | 3
DEPARTMENT of
STATISTICS
Brief History of R
• Created by Ross Ihaka and Robert
Gentleman at the University of Auckland,
New Zealand in 1993.
• R is named partly after the first names of
the first two R authors
• The CRAN package repository features
10,076 available packages. https://cran.r-
project.org/
Introduction to R | 4
DEPARTMENT of
STATISTICS
The R interface
• R has a main window (RGui) with a sub-window (R Console)
Introduction to R | 6
DEPARTMENT of
STATISTICS
Rstudio
• RStudio is an integrated development environment (IDE) for R. It includes a
console, syntax-highlighting editor that supports direct code execution, as
well as tools for plotting, history, debugging and workspace management.
• RStudio is available in open source and commercial editions and runs on the
desktop (Windows, Mac, and Linux) or in a browser connected to RStudio
Server or RStudio Workbench(Debian/Ubuntu, Red Hat/CentOS, and SUSE
Linux).
Introduction to R | 7
DEPARTMENT of
STATISTICS
Downloading and
Installing RStudio
(on Windows)
Step 1: With R-base installed,
let’s move on to installing
RStudio. To begin, go
to https://www.rstudio.com/
products/rstudio/ and click
on the download button
for RStudio desktop.
Introduction to R | 8
DEPARTMENT of
STATISTICS
Introduction to R | 9
DEPARTMENT of
STATISTICS
Introduction to R | 10
DEPARTMENT of
STATISTICS
The RStudio
interface
Introduction to R | 11
DEPARTMENT of
STATISTICS
Introduction to R | 13
DEPARTMENT of
STATISTICS Purpose Function
Objects
• An object is some data that has been given a name and stored in the
memory.
• The data could be anything from a single number to a whole table of data
of mixed types.
• You can create objects with the assignment operator, which looks like
this: <-
• Example: to create an object named height that holds the value 72.95
(a person’s height in inches), use the command
> height <- 72.95
Introduction to R | 15
DEPARTMENT of
STATISTICS
• When creating new objects, you must choose an object name that
oconsists only of upper and lower case letters, numbers, underscores (_)
and dots (.)
obegins with an upper- or lowercase letter or a dot (.)
ois not one of R’s reserved words (enter help(reserved) to see a list of
these)
• R is case-sensitive, so height, HEIGHT, and Height are all distinct object
names.
• If you choose an object name that is already in use, you will overwrite the
old object with the new one. R does not give any warning if you do this.
Introduction to R | 16
DEPARTMENT of
STATISTICS
• To view the contents of an object you have already created, enter the
object name
> height
• Once you have created an object, you can use it in place of the information
it contains.
> log(height)
• As well as creating objects with specific values, you can save the output of
a function or calculation directly to a new object. For example, the
following command converts the value of the height object from inches to
centimeters and saves the output to a new object called heightcm:
> heightcm <- round(height*2.54)
Introduction to R | 17
DEPARTMENT of
STATISTICS
• Objects like these are called numeric objects because they contain
numbers.
• You can also create other types of objects such as character objects, which
contain a combination of any keyboard characters known as a character
string. When creating a character object, enclose the character string in
quotation marks:
> string1 <- “Hello!”
Introduction to R | 18
DEPARTMENT of
STATISTICS
Vector
• A vector is an object that holds several data values of the same type
arranged in a particular order.
• You can create vectors with a special function which is named c.
Example:
Suppose that you have measured the temperature in degrees centigrade at
five randomly selected locations and recorded the data as: 3, 3.76, -0.35, 1.2,
-5. To save the data to a vector named temperatures, use the command:
> temperatures <- c(3, 3.76, -0.35, 1.2, -5)
Introduction to R | 19
DEPARTMENT of
STATISTICS
• You can view the contents of a vector by entering its name, as you would
for any other object.
> temperatures
[1] 3.00 3.76 -0.35 1.20 -5.00
• The number of values a vector holds is called its length. You can check the
length of a vector with the length function:
> length(temperatures)
[1] 5
Introduction to R | 20
DEPARTMENT of
STATISTICS
• Each data value in the vector has a position within the vector, which you
can refer to using square brackets. This is known as bracket notation. For
example, you can view the third member of temperatures with the
command:
> temperatures[3]
[1] -0.35
• If you give a vector as input to a function intended for use with a single
number (such as the exp function), R applies the function to each member
of the vector individually and gives another vector as output:
> exp(temperatures)
[1] 20.085536923 42.948425979 0.7046880903.320116923 0.006737947
Introduction to R | 21
DEPARTMENT of
STATISTICS
• If you have a large vector (such that when displayed, the values of the
vector fill several lines of the console window), the indices at the side tell
you which member of the vector each line begins with. For example, the
vector below contains twenty-seven values. The indices at the side show
that the second line begins with the eleventh member and the third line
begins with the twenty-first member. This helps you to determine the
position of each value within the vector.
[1] 0.077 0.489 1.603 2.110 2.625 1.019 1.100 1.729 2.469 -0.125
[11] 1.931 0.155 0.572 1.160 -1.405 2.868 0.632 -1.714 2.615 0.714
[21] 0.979 1.768 1.429 -0.119 0.459 1.083 -0.270
Introduction to R | 22
DEPARTMENT of
STATISTICS
Data Frames
• A data frame is a type of object that is suitable for holding a dataset.
• A data frame is composed of several vectors of the same length, displayed
vertically and arranged side by side.
• This forms a rectangular grid in which each column has a name and
contains one vector.
• Although all of the values in one column of a data frame must be of the
same type, different columns can hold different types of data (such as
numbers or character strings).
• Ideal for storing datasets, with each column holding a variable and each
row an observation.
Introduction to R | 23
DEPARTMENT of
STATISTICS
• To view the dataset into tabular form use the function View
> View(student_profile)
• You can select a whole row by leaving the column number blank. For
example to select the sixth row of the student_profile dataset:
> student_profile[6,]
student sex residence age height weight gpa allowance
6 6 F D 19 161 60 2.11 1000
Introduction to R | 26
DEPARTMENT of
STATISTICS
• Similarly to select a whole column, leave the row number blank. For
example, to select the second column of the student_profile dataset:
> student_profile[,2]
sex
[1] F M F F M F F F M M 1 F
Levels: F M 2 M
3 F
• When selecting whole columns, you can also leave out the comma 4 F
entirely and just give the column number. 5 M
6 F
> student_profile[2]
7 F
8 F
• Notice that the command Puromycin[2] produces a data frame with 9 M
one column, while the command student_profile produces a vector. 10 M
Introduction to R | 27
DEPARTMENT of
STATISTICS
• You can use the minus sign to exclude a part of the data frame instead of
selecting it. For example, to exclude the first column:
> student_profile[-1]
• You can use the colon (:) to select a range of rows or columns. For
example, to select row numbers six to ten:
> student_profile[6:10,]
student sex residence age height weight gpa allowance
6 6 F D 19 161.00 60 2.11 1000
7 7 F D 20 163.00 54 2.65 750
8 8 F CO 19 144.78 50 2.01 500
9 9 M CO 20 168.00 51 2.09 1000
10 10 M CO 20 167.64 48 2.21 1000
Introduction to R | 28
DEPARTMENT of
STATISTICS
[1] 19 19 20
• You can refer to specific entries using a combination of the variable name
and bracket notation. For example, to select the tenth observation for the
rate variable:
> student_profile$height[10]
[1] 167.64
Introduction to R | 29
DEPARTMENT of
STATISTICS
Workspaces
The workspace is the virtual area containing all of the objects you have
created in the session.
• To see a list of all of the objects in the workspace, use the objects function:
> objects()
• You can delete objects from the workspace with the rm function:
> rm(height, string1, string2)
Introduction to R | 30
DEPARTMENT of
STATISTICS
Error Messages
• Sometimes R will encounter a problem while trying to complete one of your
commands. When this happens, a message is displayed in the console
window to inform you of the problem.
• These messages come in two varieties, known as error messages and
warning messages.
• Error messages begin with the text Error: and are displayed when R is not
able to perform the command at all.
Introduction to R | 31
DEPARTMENT of
STATISTICS
Introduction to R | 32
DEPARTMENT of
STATISTICS
Introduction to R | 33
DEPARTMENT of
STATISTICS
• Warning messages begin with the text Warning: and tell you about issues
that have not prevented the command from being completed, but that you
should be aware of.
• For example, the command below calculates the natural logarithm of each
of the values in the temperatures vector. However, the logarithm cannot be
calculated for all of the values, as some of them are negative:
> log(temperatures)
[1] 1.0986123 1.3244190 NaN 0.1823216 NaN
Warning message:
In log(temperatures) : NaNs produced
Introduction to R | 34
DEPARTMENT of
STATISTICS Working with Data Files
Entering Data Directly
• If you have a small dataset that is not already recorded in electronic form,
you may want to input your data into R directly.
Example:
Data for the largest supermarket chain in 2011
Chain Stores Sales Area (1,000 sq ft) Market Share %
Morrisons 439 12261 12.3
Asda 16.9
Tesco 2715 36722 30.3
Sainsbury’s 934 19108 16.5
Introduction to R | 35
DEPARTMENT of
STATISTICS
• To enter a dataset into R, the first step is to create a vector of data values
for each variable using the c function
> Chain <- c("Morrisons", "Asda", "Tesco", "Sainsburys")
> Stores <- c(439, NA, 2715, 934)
> Sales.Area <- c(12261, NA, 36722, 19108)
> Market.Share <- c(12.3, 16.9, 30.3, 16.5)
• The vectors should all have the same length, meaning that they should
contain the same number of values.
• Where a data value is missing, enter the characters NA in its place.
• Remember to put quotation marks around non-numeric values, as shown
for the Chain variable.
Introduction to R | 36
DEPARTMENT of
STATISTICS
• Once you have created vectors for each of the variables, use the
data.frame function to combine them to form a data frame:
> supermarkets <- data.frame(Chain, Stores, Sales.Area,
Market.Share)
• You can check the dataset has been entered correctly by entering its name:
Chain Stores Sales.Area Market.Share
1 Morrisons 439 12261 12.3
2 Asda NA NA 16.9
3 Tesco 2715 36722 30.3
4 Sainsburys 934 19108 16.5
Introduction to R | 37
DEPARTMENT of
STATISTICS
Introduction to R | 38
DEPARTMENT of
STATISTICS
Introduction to R | 39
DEPARTMENT of
STATISTICS
• When you set the header argument to F, R assigns generic variable names
of V1, V2, and so on. Alternatively, you can supply your own names with
the col.names argument:
> dataset1 <- read.csv("C:/folder/filename.csv", header=F,
col.names=c("Name1", "Name2", "Name3"))
Introduction to R | 40
DEPARTMENT of
STATISTICS
Tab-Delimited Files
• The tab-delimited file format is very similar to the CSV format except that
the data values are separated with horizontal tabs instead of commas.
Example: The supermarkets dataset saved in the tab-delimited file format
Introduction to R | 41
DEPARTMENT of
STATISTICS
Introduction to R | 42
DEPARTMENT of
STATISTICS
Introduction to R | 44
DEPARTMENT of
STATISTICS
Introduction to R | 45
DEPARTMENT of
STATISTICS
Exporting Datasets
• To export a data frame named dataset to a CSV file, use the write.csv
function:
> write.csv(dataset, "filename.csv")
• To save the file somewhere other than in the working directory, enter the
full path for the file:
> write.csv(dataset, "C:/folder/filename.csv")
Introduction to R | 46
DEPARTMENT of
STATISTICS
• The write.table function allows you to export data to a wider range of file
formats, including tab-delimited files. Use the sep argument to specify
which character should be used to separate the values. To export a dataset
to a tab-delimited file, set the sep argument to "\t" (which denotes the tab
symbol):
> write.table(dataset, "filename.txt", sep="\t")
Introduction to R | 47
DEPARTMENT of Subject Eye Color Height Hand Span Sex Handedness
STATISTICS
1 Brown 186 210 1 R
2 Green 182 220 1 R
Exercise:
3 Brown 147 167 2
This dataset gives the eye 4 Green 157 180 2 L
color (brown, blue, or green), 5 Brown 170 193 1 R
height in centimeters, hand 6 Blue 169 190 2 L
span in millimeters, sex (1 for 7 brown 174 217 1 R
male, 2 for female), and 8 Blue 173 211 1 R
handedness (L for left- 9 Blue 166 193 2 R
handed, R for right-handed) 10 Blue 166 178 2 R
of sixteen people. 11 Brown 163 223 1 R
12 Blue 184 225 1 R
Encode the following dataset
13 Blue 176 214 1
in R and create a data.frame 14 Blue 183 218 1 R
using the file name “people” 15 Green 160 190 2
16 Brown 173 196 1 R
Introduction to R | 48
DEPARTMENT of
STATISTICS Preparing and Manipulating Your Data
Variable
• You can rearrange or remove the variables in a dataset with the subset
function. Use the select argument to choose which variables to keep and in
which order. Remove unwanted variables by excluding them from the list.
• For example, this command removes the Subject, Height and Handedness
variables from the people dataset, and rearranges the remaining variables
so that Hand.Span is first, followed by Sex then Eye.Color:
> people1 <- subset(people, select=c(Hand.Span, Sex, Eye.Color))
• Notice that the command creates a new dataset called people1, which is a
modified version of the original, and leaves the original dataset unchanged.
Alternatively, you can overwrite the original dataset with this:
> people <- subset(people, select=c(Hand.Span, Sex, Eye.Color))
Introduction to R | 49
DEPARTMENT of
STATISTICS
• The subset function does more than remove and rearrange variables. You
can also use it to select a subset of observations from a dataset.
Introduction to R | 50
DEPARTMENT of
STATISTICS
Renaming Variables
• The names function displays a list of the variable names for a dataset:
> names(people)
[1] "Subject" "Eye.Color" "Height" "Hand.Span" "Sex" "Handedness"
• You can also use the names function to rename variables. This command
renames the fifth variable in the people dataset:
> names(people)[5] <- "Gender"
[1] "Subject" "Eye.Color" "Height" "Hand.Span" "Gender" "Handedness"
Introduction to R | 51
DEPARTMENT of
STATISTICS
• Make sure that you provide the same number of variable names as there
are variables in the dataset.
Introduction to R | 52
DEPARTMENT of
STATISTICS
Variable Classes
• Each of the variables in a dataset has a class, which describes the type of
data the variable contains. You can view the class of a variable with the
class function:
> class(dataset$variable)
• To check the class of all the variables simultaneously, use the sapply
function:
> sapply(dataset, class)
Introduction to R | 53
DEPARTMENT of
STATISTICS
• A variable’s class determines how R will treat the variable when you use it in
statistical analysis and plots. There are many possible variable classes in R,
but only a few that you are likely to use:
1. Numeric variables contain real numbers, meaning positive or negative numbers with or
without a decimal point. They can also contain the missing data symbol (NA)
2. Integer variables contain positive or negative numbers without a decimal point. This
class behaves in much the same way as the numeric class. An integer variable is
automatically converted to a numeric variable if a value with a fractional part is
included
3. Factor variables are suitable for categorical data. Factor variables generally have a
small number of unique values, known as levels. The actual values can be either
numbers or character strings
4. Character variables contain character strings. A character string is any combination of
unicode characters including letters, numbers, and symbols. This class is suitable for
any data that does not belong to one of the other classes, such as reference numbers,
labels, and text, giving additional comments or information
Introduction to R | 54
DEPARTMENT of
STATISTICS
• You can change the class of a variable to factor with the as.factor function:
> dataset$variable <- as.factor(dataset$variable)
• If you have a variable containing numeric values that for some reason has
been assigned another class, you can change it using the as.numeric
function.
> dataset$variable <- as.numeric(dataset$variable)
• You can change the class of a variable to character using the as.character
function:
> dataset$variable <- as.character(dataset$variable)
Introduction to R | 55
DEPARTMENT of
STATISTICS
Introduction to R | 56
DEPARTMENT of
STATISTICS
• A factor variable has a number of levels, which are all of the unique values
that the variable takes (i.e., all of the possible categories). To view the
levels of a factor variable, use the levels function:
> levels(people$Sex)
[1] "1" "2"
Introduction to R | 57
DEPARTMENT of
STATISTICS
• Because the level names will appear on any plots and statistical output that
you create based on the variable, it is helpful if they are meaningful and
attractive. You can change the names of the levels:
> levels(people$Sex) <- c("Male", "Female")
• You can also combine factor levels by renaming them. Consider the Eye.Color
variable in the people dataset. Using the levels function, you can see that
there is an extra level resulting from a spelling variation:
> levels(people$Eye.Color)
[1] "Blue" "brown" "Brown" "Green"
• To rename the second factor level so that it has the correct spelling, use the
command:
> levels(people$Eye.Color)[2]<-"Brown"
Introduction to R | 58
DEPARTMENT of
STATISTICS
• When the factor levels are viewed again, you can see that the two levels
have been combined:
> levels(people$Eye.Color)
[1] "Blue" "Brown" "Green"
• You can change the order of the levels with the relevel function. For
example, to make Brown the first level of the Eye.Color variable, use the
command:
> people$Eye.Color <- relevel(people$Eye.Color, "Brown")
[1] "Brown" "Blue" "Green"
• The order of the factor levels is important, because if you include the factor
in a statistical model, R uses the first level of the factor as the reference
level.
Introduction to R | 59
DEPARTMENT of
STATISTICS
Introduction to R | 60
DEPARTMENT of
STATISTICS
• For example, to select all of the observations from the people dataset
where the value of the Eye.Color
> subset(people, Eye.Color=="Brown")
Subject Eye.Color Height Hand.Span Sex Handedness
1 Brown 186 210 1 R
3 Brown 147 167 2 NA
5 Brown 170 193 1 R
11 Brown 163 223 1 R
16 Brown 173 196 1 R
• To save the selected observations to a new dataset, assign the output to a
new dataset name:
> browneyes <- subset(people, Eye.Color=="Brown")
Introduction to R | 61
DEPARTMENT of
STATISTICS
• To select all the observations for which a variable takes any one of a list of
values, use the %in% operator. For example, to select all observations
where Eye.Color is either Brown or Green, use the command:
> subset(people, Eye.Color %in% c("Brown", "Green"))
Subject Eye.Color Height Hand.Span Sex Handedness
1 Brown 186 210 1 R
2 Green 182 220 1 R
3 Brown 147 167 2 NA
4 Green 157 180 2 L
5 Brown 170 193 1 R
11 Brown 163 223 1 R
15 Green 160 190 2 NA
16 Brown 173 196 1 R
Introduction to R | 62
DEPARTMENT of
STATISTICS
• Notice that quotation marks are not required for numeric values.
Introduction to R | 63
DEPARTMENT of
STATISTICS
• With numeric variables, you can also use relational operators to select
observations. For example, to select all observations for which the value of
the Height variable is less than 165, use the command:
> subset(people, Height<165)
Subject Eye.Color Height Hand.Span Sex Handedness
3 Brown 147 167 2 NA
4 Green 157 180 2 L
11 Brown 163 223 1 R
15 Green 160 190 2 NA
• Other relational operators you could use are > (greater than), >= (greater
than or equal to) and <= (less than or equal to).
Introduction to R | 64
DEPARTMENT of
STATISTICS
• You can combine two or more conditions using the AND operator (denoted
&) and the OR operator (denoted |). When two criteria are joined with the
AND operator, R selects only those observations that meet both conditions.
When they are joined with the OR operator, R selects the observations that
meet either one of the conditions, or both.
• For example, to select observations where Eye.Color is Brown and Height is
less than 165, use the command:
> subset(people, Eye.Color=="Brown" & Height<165)
Subject Eye.Color Height Hand.Span Sex Handedness
3 Brown 147 167 2 NA
11 Brown 163 223 1 R
Introduction to R | 65
DEPARTMENT of
STATISTICS
• As well as selecting a subset of observations from the dataset, you can also
use the select argument to select which variables to keep.
> subset(people, Height<165, select=c(Hand.Span, Height))
Hand.Span Height
167 147
180 157
223 163
190 160
• Another way to subset a dataset is using bracket notation. For example, this
command selects only those people with brown eyes:
> people[people$Eye.Color=="Brown",]
Introduction to R | 66
DEPARTMENT of
STATISTICS
Introduction to R | 67
DEPARTMENT of
STATISTICS
Sorting a Dataset
• You can use the order function to sort a dataset. For example, to sort
the people dataset by the Hand.Span variable, use the command:
> people <- people[order(people$Hand.Span),]
• You can also sort by more than one variable. To sort the dataset first
by Sex and then by Height, use the command:
> people <- people[order(people$Sex, people$Height),]
Introduction to R | 68
DEPARTMENT of
STATISTICS Combining and Restructuring Datasets
Appending Rows
• The rbind function allows you to attach one dataset on to the bottom of
the other, which is known as appending or concatenating the datasets.
Example: Combine the two datasets CIAdata1 and CIAdata2
CIAdata1 CIAdata2
Country LifeExp urban pcGDP Country pcGDP LifeExp urban
1 Finland 79.41 85 36700 1 Italy 30900 81.86 68
2 Slovakia 76.03 55 23600 2 Croatia 18400 75.99 58
3 UK 80.17 80 36600 3 Slovakia 23600 76.03 55
4 Ukrain 68.17 69 7300
5 Spain 81.27 77 31000
Introduction to R | 69
DEPARTMENT of
STATISTICS
Appending Columns
• The cbind function pastes one dataset on to the side of another.
• This is useful if the data from corresponding rows of each dataset belong to
the same observation, as is the case for the CIAdata1 and WHOdata
datasets CIAdata1
WHOdata
Country LifeExp urban pcGDP alcohol mortality
1 Finland 79.41 85 36700 1 12.52 91
2 Slovakia 76.03 55 23600 2 13.33 130
3 UK 80.17 80 36600 3 13.37 77
4 Ukrain 68.17 69 7300 4 15.6 274
5 Spain 81.27 77 31000 5 11.62 68
Introduction to R | 71
DEPARTMENT of
STATISTICS
• You can only use the cbind function to combine datasets that have the
same number of rows.
• This command combines the CIAdata1 and WHOdata datasets to create a
new dataset called CIAWHOdata;
> CIAWHOdata <- cbind(CIAdata1, WHOdata)
Introduction to R | 72
DEPARTMENT of
STATISTICS
• The following command shows how you would combine the CIAdata1 and
CPIdata datasets:
> CIACPIdata <- merge(CIAdata1, CPIdata)
country LifeExp urban pcGDP CPI
1 Finland 79.41 85 36700 99.69
2 Spain 81.27 77 31300 80.24
3 Spain 81.27 77 31300 80.24
4 UK 80.17 80 36600 100.13
5 Ukrain 68.74 69 7300 51.10
• The merge function identifies variables with the same name and uses them
to match up the observations.
Introduction to R | 74
DEPARTMENT of
STATISTICS
• When you combine two datasets with the merge function, R automatically
excludes any unmatched observations that appear in only one of the datasets.
• The all, all.x, and all.y arguments allow you to control how R deals with any
unmatched observations. To keep all unmatched observations, set the all
argument to T:
> allCIACPIdata <- merge(CIAdata1, CPIdata, all=T)
Introduction to R | 75