0% found this document useful (0 votes)
35 views75 pages

07 Introduction To R

Statistics

Uploaded by

sofiazanders4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views75 pages

07 Introduction To R

Statistics

Uploaded by

sofiazanders4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 75

Introduction to R

Zeren Lucky L. Cabanayan


STAT2100 – Statistical Analysis with Software Application
1st Semester, 2024-2025

1
CENTRAL LUZON STATE UNIVERSITY
DEPARTMENT of
STATISTICS
Learning Outcomes
After completing this chapter, the students must be able to
• Navigate the R system
• Enter and import data
• Manipulate datasets

Introduction to R | 2
DEPARTMENT of
STATISTICS R Fundamentals
What is R?
• R is a statistical analysis and graphics environment and also a programming
language.
• It is command-driven and very similar to the commercially produced S-
Plus® software.
• R is known for its professional-looking graphics, which allow complete
customization.
• R is open-source software and free to install under the GNU general public
license.
• It is written and maintained by a group of volunteers known as the R core
team.

Introduction to R | 3
DEPARTMENT of
STATISTICS

Brief History of R
• Created by Ross Ihaka and Robert
Gentleman at the University of Auckland,
New Zealand in 1993.
• R is named partly after the first names of
the first two R authors
• The CRAN package repository features
10,076 available packages. https://cran.r-
project.org/

Introduction to R | 4
DEPARTMENT of
STATISTICS

Downloading and Installing R


The R software is freely available from the R website. Windows® and Mac®
users should follow the instructions below to download the installation file:
1. Go to the R project website at www.r-project.org.
2. Follow the link to CRAN (on the left-hand side).
3. You will be taken to a list of sites that host the R installation files (mirror
sites). Select a site close to your location.
4. Select your operating system. There are installation files available for the
Windows, Mac, and Linux® operating systems.
5. If downloading R for Windows, you will be asked to select from the base or
contrib distributions. Select the base distribution.
6. Follow the link to download the R installation file and save the file to a
suitable location on your machine.
Introduction to R | 5
DEPARTMENT of
STATISTICS

The R interface
• R has a main window (RGui) with a sub-window (R Console)

• In the Console window, you


will see this red cursor
blinking where you type in
some R commands.

Introduction to R | 6
DEPARTMENT of
STATISTICS

Rstudio
• RStudio is an integrated development environment (IDE) for R. It includes a
console, syntax-highlighting editor that supports direct code execution, as
well as tools for plotting, history, debugging and workspace management.
• RStudio is available in open source and commercial editions and runs on the
desktop (Windows, Mac, and Linux) or in a browser connected to RStudio
Server or RStudio Workbench(Debian/Ubuntu, Red Hat/CentOS, and SUSE
Linux).

Introduction to R | 7
DEPARTMENT of
STATISTICS

Downloading and
Installing RStudio
(on Windows)
Step 1: With R-base installed,
let’s move on to installing
RStudio. To begin, go
to https://www.rstudio.com/
products/rstudio/ and click
on the download button
for RStudio desktop.

Introduction to R | 8
DEPARTMENT of
STATISTICS

Step 2: Click on the


link for the windows
version of RStudio and
save the .exe file.

Introduction to R | 9
DEPARTMENT of
STATISTICS

Step 3: Run the .exe and follow the installation instructions.


a. Click Next on the welcome window.
b. Enter/browse the path to the installation folder and click Next to
proceed.
c. Select the folder for the start menu shortcut or click on do not
create shortcuts and then click Next.
d. Wait for the installation process to complete.
e. Click Finish to end the installation.

Introduction to R | 10
DEPARTMENT of
STATISTICS

The RStudio
interface

Introduction to R | 11
DEPARTMENT of
STATISTICS

The R Console and Command Prompt


• Every time you start R, some text relating to copyright and other issues
appears in the console window.
• Below all of the text that appears in the console at startup you will see the
command prompt, which is colored red and looks like this: >
• The command prompt tells you that R is ready to receive your command
• R responds with a plus sign: +. If you see the plus sign, it means you need
to type the remainder of the command and press Enter. Alternatively, you
can press the Esc key to cancel the command and return to the command
prompt.
• Another time that you would not see the command prompt is when R is
still working on the task.
Introduction to R | 12
DEPARTMENT of
STATISTICS

Symbols used to represent the basic arithmetic operations


Operation Symbol If a command is composed of
Addition + several arithmetic operators,
they are evaluated in the usual
Subtraction - order of precedence, that is, first
the exponentiation (power)
Multiplication *
symbol, followed by division,
Division / then multiplication, and finally
addition and subtraction.
Exponentiation ^
You can also add parentheses to
Matrix multiplication %*% control precedence if required.

Introduction to R | 13
DEPARTMENT of
STATISTICS Purpose Function

Functions Exponential exp


Natural logarithm log
• A function is a set of commands that
Log base 10 log10
have been given a name and together
Square root sqrt
perform a specific task producing
some kind of output. Cosine cos
Sine sin
• Whenever you use a function, you will
Tangent Tan
type the function name followed by
round brackets. Any input required by Arc cosine acos
the function is placed between the Arc sin asin
brackets. Arc tangent atan
Round round
Absolute value abs
Factorial factorial
Introduction to R | 14
DEPARTMENT of
STATISTICS

Objects
• An object is some data that has been given a name and stored in the
memory.
• The data could be anything from a single number to a whole table of data
of mixed types.
• You can create objects with the assignment operator, which looks like
this: <-
• Example: to create an object named height that holds the value 72.95
(a person’s height in inches), use the command
> height <- 72.95

Introduction to R | 15
DEPARTMENT of
STATISTICS

• When creating new objects, you must choose an object name that
oconsists only of upper and lower case letters, numbers, underscores (_)
and dots (.)
obegins with an upper- or lowercase letter or a dot (.)
ois not one of R’s reserved words (enter help(reserved) to see a list of
these)
• R is case-sensitive, so height, HEIGHT, and Height are all distinct object
names.
• If you choose an object name that is already in use, you will overwrite the
old object with the new one. R does not give any warning if you do this.

Introduction to R | 16
DEPARTMENT of
STATISTICS

• To view the contents of an object you have already created, enter the
object name
> height

• Once you have created an object, you can use it in place of the information
it contains.
> log(height)

• As well as creating objects with specific values, you can save the output of
a function or calculation directly to a new object. For example, the
following command converts the value of the height object from inches to
centimeters and saves the output to a new object called heightcm:
> heightcm <- round(height*2.54)

Introduction to R | 17
DEPARTMENT of
STATISTICS

• To change the contents of an object, simply overwrite it with a new value:


> height <- 69.45

• Objects like these are called numeric objects because they contain
numbers.
• You can also create other types of objects such as character objects, which
contain a combination of any keyboard characters known as a character
string. When creating a character object, enclose the character string in
quotation marks:
> string1 <- “Hello!”

Introduction to R | 18
DEPARTMENT of
STATISTICS

Vector
• A vector is an object that holds several data values of the same type
arranged in a particular order.
• You can create vectors with a special function which is named c.

Example:
Suppose that you have measured the temperature in degrees centigrade at
five randomly selected locations and recorded the data as: 3, 3.76, -0.35, 1.2,
-5. To save the data to a vector named temperatures, use the command:
> temperatures <- c(3, 3.76, -0.35, 1.2, -5)

Introduction to R | 19
DEPARTMENT of
STATISTICS

• You can view the contents of a vector by entering its name, as you would
for any other object.
> temperatures
[1] 3.00 3.76 -0.35 1.20 -5.00

• The number of values a vector holds is called its length. You can check the
length of a vector with the length function:
> length(temperatures)
[1] 5

Introduction to R | 20
DEPARTMENT of
STATISTICS

• Each data value in the vector has a position within the vector, which you
can refer to using square brackets. This is known as bracket notation. For
example, you can view the third member of temperatures with the
command:
> temperatures[3]
[1] -0.35

• If you give a vector as input to a function intended for use with a single
number (such as the exp function), R applies the function to each member
of the vector individually and gives another vector as output:
> exp(temperatures)
[1] 20.085536923 42.948425979 0.7046880903.320116923 0.006737947

Introduction to R | 21
DEPARTMENT of
STATISTICS

• If you have a large vector (such that when displayed, the values of the
vector fill several lines of the console window), the indices at the side tell
you which member of the vector each line begins with. For example, the
vector below contains twenty-seven values. The indices at the side show
that the second line begins with the eleventh member and the third line
begins with the twenty-first member. This helps you to determine the
position of each value within the vector.
[1] 0.077 0.489 1.603 2.110 2.625 1.019 1.100 1.729 2.469 -0.125
[11] 1.931 0.155 0.572 1.160 -1.405 2.868 0.632 -1.714 2.615 0.714
[21] 0.979 1.768 1.429 -0.119 0.459 1.083 -0.270

Introduction to R | 22
DEPARTMENT of
STATISTICS

Data Frames
• A data frame is a type of object that is suitable for holding a dataset.
• A data frame is composed of several vectors of the same length, displayed
vertically and arranged side by side.
• This forms a rectangular grid in which each column has a name and
contains one vector.
• Although all of the values in one column of a data frame must be of the
same type, different columns can hold different types of data (such as
numbers or character strings).
• Ideal for storing datasets, with each column holding a variable and each
row an observation.

Introduction to R | 23
DEPARTMENT of
STATISTICS

Example: Student profile dataset


student <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
sex <- c("F", "M", "F", "F", "M", "F", "F", "F", "M", "M")
residence <- c("D", "CI", "CO", "CI", "CI", "D", "D", "CO", "CO", "CO")
age <- c(17, 21, 21, 18, 18, 19, 20, 19, 20, 20)
height <- c(175, 172, 152, 150, 151, 161, 163, 144.78, 168, 167.64)
weight <- c(54, 40, 50, 44, 51, 60, 54, 50, 51, 48)
gpa <- c(1.96, 2.24, 2.32, 2.63, 2.29, 2.11, 2.65, 2.01, 2.09, 2.21)
allowance <- c(600, 1000, 1500, 1000, 1000, 1000, 750, 500, 1000, 1000

• To create a data frame use the function data.frame


> student_profile <- data.frame(name, sex, residence, age, height,
weight, gpa, allowance)
Introduction to R | 24
DEPARTMENT of
STATISTICS

• To view the dataset into tabular form use the function View
> View(student_profile)

• It is important to know how to refer to the different components of a data


frame. To refer to a particular variable within a dataset by name, use the
dollar symbol ($):
> student_profile$age
[1] 17 21 21 18 18 19 20 19 20 20
Introduction to R | 25
DEPARTMENT of
STATISTICS

• Bracket notation can be thought of as a coordinate system for the data


frame. You provide the row number and column number between square
brackets. For example, to select the value in the sixth row of the second
column of the student_profile dataset using bracket notation, use the
command:
> student_profile[6,2]
[1] F

• You can select a whole row by leaving the column number blank. For
example to select the sixth row of the student_profile dataset:
> student_profile[6,]
student sex residence age height weight gpa allowance
6 6 F D 19 161 60 2.11 1000
Introduction to R | 26
DEPARTMENT of
STATISTICS

• Similarly to select a whole column, leave the row number blank. For
example, to select the second column of the student_profile dataset:
> student_profile[,2]
sex
[1] F M F F M F F F M M 1 F
Levels: F M 2 M
3 F
• When selecting whole columns, you can also leave out the comma 4 F
entirely and just give the column number. 5 M
6 F
> student_profile[2]
7 F
8 F
• Notice that the command Puromycin[2] produces a data frame with 9 M
one column, while the command student_profile produces a vector. 10 M

Introduction to R | 27
DEPARTMENT of
STATISTICS

• You can use the minus sign to exclude a part of the data frame instead of
selecting it. For example, to exclude the first column:
> student_profile[-1]

• You can use the colon (:) to select a range of rows or columns. For
example, to select row numbers six to ten:
> student_profile[6:10,]
student sex residence age height weight gpa allowance
6 6 F D 19 161.00 60 2.11 1000
7 7 F D 20 163.00 54 2.65 750
8 8 F CO 19 144.78 50 2.01 500
9 9 M CO 20 168.00 51 2.09 1000
10 10 M CO 20 167.64 48 2.21 1000

Introduction to R | 28
DEPARTMENT of
STATISTICS

• You can also use object names in place of numbers:


> rownum <- c(6,8,10)
> Colnum <- 4
> student_profile[rownum,colnum]

[1] 19 19 20

• You can refer to specific entries using a combination of the variable name
and bracket notation. For example, to select the tenth observation for the
rate variable:
> student_profile$height[10]
[1] 167.64

Introduction to R | 29
DEPARTMENT of
STATISTICS

Workspaces
The workspace is the virtual area containing all of the objects you have
created in the session.

• To see a list of all of the objects in the workspace, use the objects function:
> objects()

• You can delete objects from the workspace with the rm function:
> rm(height, string1, string2)

• To delete all of the objects in the workspace, use the command:


> rm(list=objects())

Introduction to R | 30
DEPARTMENT of
STATISTICS

Error Messages
• Sometimes R will encounter a problem while trying to complete one of your
commands. When this happens, a message is displayed in the console
window to inform you of the problem.
• These messages come in two varieties, known as error messages and
warning messages.
• Error messages begin with the text Error: and are displayed when R is not
able to perform the command at all.

Introduction to R | 31
DEPARTMENT of
STATISTICS

• One of most common causes of error messages is giving a command that is


not a valid R command because it contains a symbol that R does not
understand, or because a symbol is missing or in the wrong place. This is
known as a syntax error. In the following example, the error is caused by an
extra closing parenthesis at the end of the command:
> round(3.141592))
Error: unexpected ')' in "round(3.141592))"

• Another common cause of errors is mistyping an object name so that you


are referring to an object that does not exist. Remember that object names
are case-sensitive:
> log(object5)
Error: object 'object5' not found

Introduction to R | 32
DEPARTMENT of
STATISTICS

• The same applies to function names, which are also case-sensitive:


> Log(3.141592)
Error: could not find function "Log"

• A third common cause of errors is giving the wrong type of input to a


function, such as a data frame where a vector is expected, or a character
string where a number is expected:
> log("Hello!")
Error in log("Hello!") : Non-numeric argument to mathematical
function

Introduction to R | 33
DEPARTMENT of
STATISTICS

• Warning messages begin with the text Warning: and tell you about issues
that have not prevented the command from being completed, but that you
should be aware of.
• For example, the command below calculates the natural logarithm of each
of the values in the temperatures vector. However, the logarithm cannot be
calculated for all of the values, as some of them are negative:
> log(temperatures)
[1] 1.0986123 1.3244190 NaN 0.1823216 NaN
Warning message:
In log(temperatures) : NaNs produced

• Although R is still able to perform the command and produce output, it


displays a warning message to draw to your attention to this issue.

Introduction to R | 34
DEPARTMENT of
STATISTICS Working with Data Files
Entering Data Directly
• If you have a small dataset that is not already recorded in electronic form,
you may want to input your data into R directly.
Example:
Data for the largest supermarket chain in 2011
Chain Stores Sales Area (1,000 sq ft) Market Share %
Morrisons 439 12261 12.3
Asda 16.9
Tesco 2715 36722 30.3
Sainsbury’s 934 19108 16.5

Introduction to R | 35
DEPARTMENT of
STATISTICS

• To enter a dataset into R, the first step is to create a vector of data values
for each variable using the c function
> Chain <- c("Morrisons", "Asda", "Tesco", "Sainsburys")
> Stores <- c(439, NA, 2715, 934)
> Sales.Area <- c(12261, NA, 36722, 19108)
> Market.Share <- c(12.3, 16.9, 30.3, 16.5)
• The vectors should all have the same length, meaning that they should
contain the same number of values.
• Where a data value is missing, enter the characters NA in its place.
• Remember to put quotation marks around non-numeric values, as shown
for the Chain variable.

Introduction to R | 36
DEPARTMENT of
STATISTICS

• Once you have created vectors for each of the variables, use the
data.frame function to combine them to form a data frame:
> supermarkets <- data.frame(Chain, Stores, Sales.Area,
Market.Share)

• You can check the dataset has been entered correctly by entering its name:
Chain Stores Sales.Area Market.Share
1 Morrisons 439 12261 12.3
2 Asda NA NA 16.9
3 Tesco 2715 36722 30.3
4 Sainsburys 934 19108 16.5

Introduction to R | 37
DEPARTMENT of
STATISTICS

Importing Plain Text Files


• The simplest way to transfer data to R is in a plain text file, sometimes
called a flat text file.
• These are files that consist of plain text with no additional formatting and
can be read by plain text editors such as Microsoft Notepad, TextEdit (for
Mac users), or gedit (for Linux users).
• There are several standard formats for storing spreadsheet data in text
files, which use symbols to indicate the layout of the data. These include:
o Comma-separated values or comma-delimited (.csv) files
o Tab-delimited (.txt) files
o Data interchange format (.dif) files

Introduction to R | 38
DEPARTMENT of
STATISTICS

Comma-separated values (CSV) files


• Comma-separated values (CSV) files are the most popular way of storing
spreadsheet data in a plain text file.
• In a CSV file, the data values are arranged with one observation per line
and commas are used to separate data value within each line (hence the
name).
Example: The supermarkets dataset saved in the CSV file format
You can import a CSV file with the
read.csv function:
> dataset1 <- read.csv
("C:/folder/filename.csv")

Introduction to R | 39
DEPARTMENT of
STATISTICS

• When you use the read.csv or read.delim functions to import a file, R


assumes that the entries in the first line of the file are the variable names
for the dataset.
• If your file does not contain any variable names, set the header argument
to F (for false), as shown here. This prevents R from using the first line of
your data as the variable names:
> dataset1 <- read.csv("C:/folder/filename.csv", header=F)

• When you set the header argument to F, R assigns generic variable names
of V1, V2, and so on. Alternatively, you can supply your own names with
the col.names argument:
> dataset1 <- read.csv("C:/folder/filename.csv", header=F,
col.names=c("Name1", "Name2", "Name3"))

Introduction to R | 40
DEPARTMENT of
STATISTICS

Tab-Delimited Files
• The tab-delimited file format is very similar to the CSV format except that
the data values are separated with horizontal tabs instead of commas.
Example: The supermarkets dataset saved in the tab-delimited file format

For importing tab-delimited files, there is a similar function called read.delim:


> dataset1 <- read.delim("C:/folder/filename.txt")

Introduction to R | 41
DEPARTMENT of
STATISTICS

Importing Excel Files


• First open your file in Excel and
ensure that the data is arranged
correctly within the spreadsheet,
with one variable per column and
one observation per row.
• If the dataset includes variable
names, then these should be
placed in the first row of the
spreadsheet.

Introduction to R | 42
DEPARTMENT of
STATISTICS

• To ensure a smooth file conversion, check the following:


o There are no empty cells above or to the left of the data grid
o There are no merged cells
o There is no more than one row of column headers
o There is no formatting such as bold, italic or colored text, cell borders or
background colors
o Where data values are missing, the cell is left empty
o There are no commas in large numbers (e.g., 1324157 is acceptable but 1,324,157
is not)
o If exponential (scientific) notation is used, the format is correct (e.g., 0.00312 can
be expressed as 3.12e-3 or 3.12E-3)
o There are no currency, unit, or percent symbols in numeric variables (symbols in
categorical variables or in the variable names are fine)
o The minus sign is used to indicate negative numbers (e.g., -5) and not brackets
(parentheses) or red text
o The workbook has only one worksheet
Introduction to R | 43
DEPARTMENT of
STATISTICS

• When the data is prepared, save


the spreadsheet as a CSV file by
selecting Save As from the File
menu.
• If you do not have access to Excel,
you can use an add-on package
such as xlsx or xlsReadWrite to
import Excel files directly

Introduction to R | 44
DEPARTMENT of
STATISTICS

Importing Files from Other Software


• You can use an add-on Some of the Functions Available in the Foreign Add-on Package
package called foreign, File type Extension Function
which allows you to Database format file .dbf read.dbf
directly import data Stata versions 5 to 12 data file .dta read.dta
from files types Minitab portable worksheet file .mtp read.mtp
produced by some of SPSS data file .sav read.spss
the popular statistical SAS transfer format .xport read.xport
software packages. Epi Info data file .rec read.epiinfo
Octave text data file .txt read.octave
Attribute-relation file .arff read.arff
Systat file .sys, .syd read.systat

Introduction to R | 45
DEPARTMENT of
STATISTICS

Exporting Datasets
• To export a data frame named dataset to a CSV file, use the write.csv
function:
> write.csv(dataset, "filename.csv")

• For example, to export the student_profile dataset to a file named


student_profile.csv, use the command:
> write.csv(student_profile, " student_profile.csv")

• To save the file somewhere other than in the working directory, enter the
full path for the file:
> write.csv(dataset, "C:/folder/filename.csv")

Introduction to R | 46
DEPARTMENT of
STATISTICS

• The write.table function allows you to export data to a wider range of file
formats, including tab-delimited files. Use the sep argument to specify
which character should be used to separate the values. To export a dataset
to a tab-delimited file, set the sep argument to "\t" (which denotes the tab
symbol):
> write.table(dataset, "filename.txt", sep="\t")

• By default, the write.csv and write.table functions create an extra column


in the file containing the observation numbers. To prevent this, set the
row.names argument to F:
> write.csv(dataset, "filename.csv", row.names=F)

Introduction to R | 47
DEPARTMENT of Subject Eye Color Height Hand Span Sex Handedness
STATISTICS
1 Brown 186 210 1 R
2 Green 182 220 1 R
Exercise:
3 Brown 147 167 2
This dataset gives the eye 4 Green 157 180 2 L
color (brown, blue, or green), 5 Brown 170 193 1 R
height in centimeters, hand 6 Blue 169 190 2 L
span in millimeters, sex (1 for 7 brown 174 217 1 R
male, 2 for female), and 8 Blue 173 211 1 R
handedness (L for left- 9 Blue 166 193 2 R
handed, R for right-handed) 10 Blue 166 178 2 R
of sixteen people. 11 Brown 163 223 1 R
12 Blue 184 225 1 R
Encode the following dataset
13 Blue 176 214 1
in R and create a data.frame 14 Blue 183 218 1 R
using the file name “people” 15 Green 160 190 2
16 Brown 173 196 1 R
Introduction to R | 48
DEPARTMENT of
STATISTICS Preparing and Manipulating Your Data
Variable
• You can rearrange or remove the variables in a dataset with the subset
function. Use the select argument to choose which variables to keep and in
which order. Remove unwanted variables by excluding them from the list.
• For example, this command removes the Subject, Height and Handedness
variables from the people dataset, and rearranges the remaining variables
so that Hand.Span is first, followed by Sex then Eye.Color:
> people1 <- subset(people, select=c(Hand.Span, Sex, Eye.Color))
• Notice that the command creates a new dataset called people1, which is a
modified version of the original, and leaves the original dataset unchanged.
Alternatively, you can overwrite the original dataset with this:
> people <- subset(people, select=c(Hand.Span, Sex, Eye.Color))

Introduction to R | 49
DEPARTMENT of
STATISTICS

• The subset function does more than remove and rearrange variables. You
can also use it to select a subset of observations from a dataset.

• Another way of removing variables from a dataset is with bracket notation.


This is particularly useful if you have a dataset with a large number of
variables and you only want to remove a few. For example, to remove the
first, third, and sixth variables from the people dataset, use the command:
> people1 <- people[-c(1,3,6)]

Introduction to R | 50
DEPARTMENT of
STATISTICS

Renaming Variables
• The names function displays a list of the variable names for a dataset:
> names(people)
[1] "Subject" "Eye.Color" "Height" "Hand.Span" "Sex" "Handedness"

• You can also use the names function to rename variables. This command
renames the fifth variable in the people dataset:
> names(people)[5] <- "Gender"
[1] "Subject" "Eye.Color" "Height" "Hand.Span" "Gender" "Handedness"

Introduction to R | 51
DEPARTMENT of
STATISTICS

• Similarly, to rename the second, fourth, and fifth variables:


> names(people)[c(2,4,5)] <- c("Eyes", "Span.mm", "Gender")

• Alternatively you can rename all of the variables in the dataset


simultaneously:
> names(people) <- c("Subject", "Eyes", "Height.cm", "Span.mm",
"Gender", "Hand")

• Make sure that you provide the same number of variable names as there
are variables in the dataset.

Introduction to R | 52
DEPARTMENT of
STATISTICS

Variable Classes
• Each of the variables in a dataset has a class, which describes the type of
data the variable contains. You can view the class of a variable with the
class function:
> class(dataset$variable)

• To check the class of all the variables simultaneously, use the sapply
function:
> sapply(dataset, class)

Introduction to R | 53
DEPARTMENT of
STATISTICS

• A variable’s class determines how R will treat the variable when you use it in
statistical analysis and plots. There are many possible variable classes in R,
but only a few that you are likely to use:
1. Numeric variables contain real numbers, meaning positive or negative numbers with or
without a decimal point. They can also contain the missing data symbol (NA)
2. Integer variables contain positive or negative numbers without a decimal point. This
class behaves in much the same way as the numeric class. An integer variable is
automatically converted to a numeric variable if a value with a fractional part is
included
3. Factor variables are suitable for categorical data. Factor variables generally have a
small number of unique values, known as levels. The actual values can be either
numbers or character strings
4. Character variables contain character strings. A character string is any combination of
unicode characters including letters, numbers, and symbols. This class is suitable for
any data that does not belong to one of the other classes, such as reference numbers,
labels, and text, giving additional comments or information
Introduction to R | 54
DEPARTMENT of
STATISTICS

• You can change the class of a variable to factor with the as.factor function:
> dataset$variable <- as.factor(dataset$variable)

• If you have a variable containing numeric values that for some reason has
been assigned another class, you can change it using the as.numeric
function.
> dataset$variable <- as.numeric(dataset$variable)

• You can change the class of a variable to character using the as.character
function:
> dataset$variable <- as.character(dataset$variable)

Introduction to R | 55
DEPARTMENT of
STATISTICS

Dividing a Continuous Variable into Categories


• Sometimes you may want to create a new categorical variable by classifying
the observations according to the value of a continuous variable.
Example:
Suppose that you want to create a new variable called Height.Cat, which
classifies the people as “Short”, “Medium”, and “Tall” according to their
height. People less than 160 cm tall are classified as Short, people between
160 cm and 180 cm tall are classified as Medium, and people greater than
180 cm tall are classified as Tall.
> people$Height.Cat <- cut(people$Height, c(140, 160, 180, 200),
c("Short", "Medium", "Tall"))

Introduction to R | 56
DEPARTMENT of
STATISTICS

Working with Factor Variables


• As explained under “Variable classes,” factor variables are suitable for
holding categorical data.
• To change the class of a variable to factor, use the as.factor function:
> people$Sex <- as.factor(people$Sex)

• A factor variable has a number of levels, which are all of the unique values
that the variable takes (i.e., all of the possible categories). To view the
levels of a factor variable, use the levels function:
> levels(people$Sex)
[1] "1" "2"

Introduction to R | 57
DEPARTMENT of
STATISTICS

• Because the level names will appear on any plots and statistical output that
you create based on the variable, it is helpful if they are meaningful and
attractive. You can change the names of the levels:
> levels(people$Sex) <- c("Male", "Female")
• You can also combine factor levels by renaming them. Consider the Eye.Color
variable in the people dataset. Using the levels function, you can see that
there is an extra level resulting from a spelling variation:
> levels(people$Eye.Color)
[1] "Blue" "brown" "Brown" "Green"

• To rename the second factor level so that it has the correct spelling, use the
command:
> levels(people$Eye.Color)[2]<-"Brown"

Introduction to R | 58
DEPARTMENT of
STATISTICS

• When the factor levels are viewed again, you can see that the two levels
have been combined:
> levels(people$Eye.Color)
[1] "Blue" "Brown" "Green"
• You can change the order of the levels with the relevel function. For
example, to make Brown the first level of the Eye.Color variable, use the
command:
> people$Eye.Color <- relevel(people$Eye.Color, "Brown")
[1] "Brown" "Blue" "Green"
• The order of the factor levels is important, because if you include the factor
in a statistical model, R uses the first level of the factor as the reference
level.
Introduction to R | 59
DEPARTMENT of
STATISTICS

Selecting a Subset of the Data


Observations can be selected according to selection criteria based on
properties of the data, or randomly to form a random sample.

1. Selecting a Subset According to Selection Criteria


• Sometimes you may need to select a subset of a dataset containing only
those observations that match certain criteria, such as belonging to a
particular category or where the value of one of the numeric variables falls
within a given range. You can do this with the subset function. The
command takes the general form:
> subset(dataset, condition)

Introduction to R | 60
DEPARTMENT of
STATISTICS

• For example, to select all of the observations from the people dataset
where the value of the Eye.Color
> subset(people, Eye.Color=="Brown")
Subject Eye.Color Height Hand.Span Sex Handedness
1 Brown 186 210 1 R
3 Brown 147 167 2 NA
5 Brown 170 193 1 R
11 Brown 163 223 1 R
16 Brown 173 196 1 R
• To save the selected observations to a new dataset, assign the output to a
new dataset name:
> browneyes <- subset(people, Eye.Color=="Brown")

Introduction to R | 61
DEPARTMENT of
STATISTICS

• To select all the observations for which a variable takes any one of a list of
values, use the %in% operator. For example, to select all observations
where Eye.Color is either Brown or Green, use the command:
> subset(people, Eye.Color %in% c("Brown", "Green"))
Subject Eye.Color Height Hand.Span Sex Handedness
1 Brown 186 210 1 R
2 Green 182 220 1 R
3 Brown 147 167 2 NA
4 Green 157 180 2 L
5 Brown 170 193 1 R
11 Brown 163 223 1 R
15 Green 160 190 2 NA
16 Brown 173 196 1 R
Introduction to R | 62
DEPARTMENT of
STATISTICS

• To select observations to exclude instead of to include, replace == with !=


(which mean “not equal to”). For example, to exclude all observations
where the value of Eye.Color is equal to "Blue", use the command:
> subset(people, Eye.Color!="Blue")

• Observations can also be selected according to the value of a numeric


variable. For example, to select all observations from the people dataset
where the Height variable is equal to 169, use the command:
> subset(people, Height==169)
Subject Eye.Color Height Hand.Span Sex Handedness
6 Blue 169 190 2 L

• Notice that quotation marks are not required for numeric values.

Introduction to R | 63
DEPARTMENT of
STATISTICS

• With numeric variables, you can also use relational operators to select
observations. For example, to select all observations for which the value of
the Height variable is less than 165, use the command:
> subset(people, Height<165)
Subject Eye.Color Height Hand.Span Sex Handedness
3 Brown 147 167 2 NA
4 Green 157 180 2 L
11 Brown 163 223 1 R
15 Green 160 190 2 NA

• Other relational operators you could use are > (greater than), >= (greater
than or equal to) and <= (less than or equal to).

Introduction to R | 64
DEPARTMENT of
STATISTICS

• You can combine two or more conditions using the AND operator (denoted
&) and the OR operator (denoted |). When two criteria are joined with the
AND operator, R selects only those observations that meet both conditions.
When they are joined with the OR operator, R selects the observations that
meet either one of the conditions, or both.
• For example, to select observations where Eye.Color is Brown and Height is
less than 165, use the command:
> subset(people, Eye.Color=="Brown" & Height<165)
Subject Eye.Color Height Hand.Span Sex Handedness
3 Brown 147 167 2 NA
11 Brown 163 223 1 R

Introduction to R | 65
DEPARTMENT of
STATISTICS

• As well as selecting a subset of observations from the dataset, you can also
use the select argument to select which variables to keep.
> subset(people, Height<165, select=c(Hand.Span, Height))

Hand.Span Height
167 147
180 157
223 163
190 160

• Another way to subset a dataset is using bracket notation. For example, this
command selects only those people with brown eyes:
> people[people$Eye.Color=="Brown",]

Introduction to R | 66
DEPARTMENT of
STATISTICS

2. Selecting a Random Sample from a Dataset


• To select a random sample of observations from a dataset, use the sample
function. For example, the following command selects a random sample of
50 observations from a dataset named dataset and saves them to new
dataset named sampledata:
> sampledata <- dataset[sample(1:nrow(dataset), 50),]

• By default, the sample function samples without replacement, so that no


observation can be selected more than once. For this reason, the sample
size must be less than the number of observations in the dataset. To
sample with replacement, set the replace argument to T:
> sampledata <- dataset[sample(1:nrow(dataset), 50, replace=T),]

Introduction to R | 67
DEPARTMENT of
STATISTICS

Sorting a Dataset
• You can use the order function to sort a dataset. For example, to sort
the people dataset by the Hand.Span variable, use the command:
> people <- people[order(people$Hand.Span),]

• To sort in decreasing instead of ascending order, set the decreasing


argument to T:
> people <- people[order(people$Hand.Span, decreasing=T),]

• You can also sort by more than one variable. To sort the dataset first
by Sex and then by Height, use the command:
> people <- people[order(people$Sex, people$Height),]

Introduction to R | 68
DEPARTMENT of
STATISTICS Combining and Restructuring Datasets
Appending Rows
• The rbind function allows you to attach one dataset on to the bottom of
the other, which is known as appending or concatenating the datasets.
Example: Combine the two datasets CIAdata1 and CIAdata2
CIAdata1 CIAdata2
Country LifeExp urban pcGDP Country pcGDP LifeExp urban
1 Finland 79.41 85 36700 1 Italy 30900 81.86 68
2 Slovakia 76.03 55 23600 2 Croatia 18400 75.99 58
3 UK 80.17 80 36600 3 Slovakia 23600 76.03 55
4 Ukrain 68.17 69 7300
5 Spain 81.27 77 31000

Introduction to R | 69
DEPARTMENT of
STATISTICS

• Before using the rbind function, make


sure that each dataset contains the
same number of variables and that all of
the variable names match.
• The variables do not need to be
arranged in the same order within the
datasets, as the rbind function
automatically matches them by name.
• Once the datasets are prepared, append
them with the rbind function, as shown
here for the CIAdata1 and CIAdata2
datasets:
> CIAdata <- rbind(CIAdata1, CIAdata2)
Introduction to R | 70
DEPARTMENT of
STATISTICS

Appending Columns
• The cbind function pastes one dataset on to the side of another.
• This is useful if the data from corresponding rows of each dataset belong to
the same observation, as is the case for the CIAdata1 and WHOdata
datasets CIAdata1
WHOdata
Country LifeExp urban pcGDP alcohol mortality
1 Finland 79.41 85 36700 1 12.52 91
2 Slovakia 76.03 55 23600 2 13.33 130
3 UK 80.17 80 36600 3 13.37 77
4 Ukrain 68.17 69 7300 4 15.6 274
5 Spain 81.27 77 31000 5 11.62 68

Introduction to R | 71
DEPARTMENT of
STATISTICS

• You can only use the cbind function to combine datasets that have the
same number of rows.
• This command combines the CIAdata1 and WHOdata datasets to create a
new dataset called CIAWHOdata;
> CIAWHOdata <- cbind(CIAdata1, WHOdata)

country LifeExp urban pcGDP alcohol mortality


1 Finland 79.41 85 36700 12.52 91
2 Slovakia 76.03 55 23600 13.33 130
3 UK 80.17 80 36600 13.37 77
4 Ukrain 68.74 69 7300 15.60 274
5 Spain 81.27 77 31300 11.62 68

Introduction to R | 72
DEPARTMENT of
STATISTICS

Merging Datasets by Common Variables


• The merge function allows you to combine two datasets by matching the
observations according to the values of common variables.
• Consider the CIAdata1 and CPIdata datasets. The datasets have a common
variable called country, which can be used to match corresponding
observations. Country CPI
CIAdata1 1 Spain 80.24
Country LifeExp urban pcGDP 2 UK 100.13 CPIdata
1 Finland 79.41 85 36700 3 Croatia 67.54
2 Slovakia 76.03 55 23600 4 Italy 94.82
3 UK 80.17 80 36600 5 Ukrain 51.1
4 Ukrain 68.17 69 7300 6 Finland 99.69
5 Spain 81.27 77 31000 7 Spain 80.24
Introduction to R | 73
DEPARTMENT of
STATISTICS

• The following command shows how you would combine the CIAdata1 and
CPIdata datasets:
> CIACPIdata <- merge(CIAdata1, CPIdata)
country LifeExp urban pcGDP CPI
1 Finland 79.41 85 36700 99.69
2 Spain 81.27 77 31300 80.24
3 Spain 81.27 77 31300 80.24
4 UK 80.17 80 36600 100.13
5 Ukrain 68.74 69 7300 51.10

• The merge function identifies variables with the same name and uses them
to match up the observations.
Introduction to R | 74
DEPARTMENT of
STATISTICS

• When you combine two datasets with the merge function, R automatically
excludes any unmatched observations that appear in only one of the datasets.
• The all, all.x, and all.y arguments allow you to control how R deals with any
unmatched observations. To keep all unmatched observations, set the all
argument to T:
> allCIACPIdata <- merge(CIAdata1, CPIdata, all=T)

Introduction to R | 75

You might also like