(1111) An Introduction To R Programming
(1111) An Introduction To R Programming
Aziz Nanthaamornphong
The information provided within this book is for general informational purposes only. While I try
to keep the information up-to-date and correct, there are no representations or warranties, express
or implied, about the completeness, accuracy, reliability, suitability or availability with respect to
the information, products, services, or related graphics contained in this book for any purpose.
Any use of this information is at your own risk.
Contents
1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1 R Programming 7
1.2 R vs. Python 8
1.3 Why Use R? 8
1.4 Skills of Data Science 8
1.5 The Data Science Process 8
1.6 Image Processing with R 10
1.7 Data Visualization with R 11
1.8 About R 11
1.8.1 R Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.8.2 R Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.8.3 R Session . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.8.4 R Warning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.8.5 My First Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.8.6 R Workspace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.8.7 R Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.8.8 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2 Foundation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1 R Installation Setup 15
2.2 Arithmetic with R 15
2.3 Getting Help with R 16
2.4 Comments 16
2.5 Print 17
2.6 Formatting 17
2.7 Variables 17
2.8 Built-in Constants 19
2.9 R Data Types 19
2.10 Checking Data Type Classes 20
2.11 Vector Basics 20
2.12 Multiple Indexing 22
3 Exercises : R Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4 Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.1 Matrix 29
4.2 Dataframe Basics 35
4.3 R Lists Basics 40
6 Control Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.1 Logical Operators 45
6.2 Use Case Example 46
6.3 for loops 48
6.4 while loops 50
8 Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
8.1 Function Structure 55
9 Exercises : Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
10 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
10.1 Common Statistics Methods 61
10.2 R Commander 61
10.3 Correlation 63
10.3.1 Pearson correlations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
10.4 Regression 64
10.4.1 Multiple Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
10.5 Comparing 2 means (t-test) 65
10.5.1 The Independent t-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
10.5.2 The dependent t-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
10.6 Analysis of Variance (ANOVA) 67
10.6.1 One-Way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
10.6.2 Post Hoc Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
10.6.3 2-Way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
10.6.4 Factorial ANOVA Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
11 Data Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
11.1 Data Visualization 69
11.1.1 ggplot2 Grammar of Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
11.1.2 Layers for building Visualizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
11.1.3 Case Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
11.1.4 Scatterplots with ggplot2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
11.1.5 Barplots with ggplot2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
11.1.6 Boxplots with ggplot2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
11.1.7 Coordinates and Faceting with ggplot2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
11.1.8 Setting x and y limits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
11.1.9 Aspect Ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
11.1.10 Facets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
11.1.11 Themes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
1.1 R Programming
Source: https://spectrum.ieee.org/computing/software/the-2017-top-programming-languages
8 Chapter 1. Overview
Source: https://blog.dominodatalab.com/comparing-python-and-r-for-data-science
“Spending 100 hours on R will yield vastly better than spending 10 hours on 10 different
tools.”
Data Understanding
Data Preparation
Analytical Modelling
10 Chapter 1. Overview
Evaluation
1.8 About R
• R an open source statistical programming language and environment for statistical comput-
ing and graphics
• R supports user defined functions, and is capable of run-time calls to C, C++, FORTRAN
codes
• Available for Windows, Mac or Linux
• Developed by Ross Ihaka and Robert Gentlemen, University of Auckland, in 1995.
• Capability of R can be extended by packages ( 1300)
• R feels and looks are the same regardless of the underlying operating system (for the most
part)
R Rather than setting up a complete analysis at once, the process is highly interactive. You
run a command, take the results and process it through another command, take those results
and process it through another command. The cycle may include transforming the data, and
looping back through the whole process again. You stop when you feel that you have fully
analyzed the data.
1.8.1 R Overview
• You can enter commands one at a time at the command prompt (>) or run a set of commands
from a source file
• There is a wide variety of data types, including vectors (numerical, character, logical), ma-
trices, dataframes, and lists
• To quit R, use > q()
12 Chapter 1. Overview
• Most functionality is provided through built-in and user-created functions and all data ob-
jects are kept in memory during an interactive session
• Basic functions are available by default. Other functions are contained in packages that can
be attached to a current session as needed
• A key skill to using R effectively is learning how to use the built-in help system. Other
sections describe the working environment, inputting programs and outputting results, in-
stalling new functionality through packages and etc
• A fundamental design feature of R is that the output from most functions can be used as
input to other functions
1.8.2 R Interface
• Start the R system, the main window (RGui) with a sub window (R Console) will appear
• In the ‘Console’ window the cursor is waiting for you to type in some R commands
1.8.3 R Session
1.8.4 R Warning
■ Example 1.1 FOO, Foo, and foo are three different objects ■
1.8 About R 13
1.8.6 R Workspace
• Objects that you create during an R session are hold in memory, the collection of objects
that you currently have is called the workspace
• The workspace is not saved on disk unless you tell R to do so
• Your objects are lost when you close R and not save the objects, or worse when R or your
system crashes on you during a session
• If you have saved a workspace image and you start R the next time, it will restore the
workspace. So all your previously saved objects are available again
• Commands are entered interactively at the R user prompt. Up and down arrow keys scroll
through your command history
1.8.7 R Packages
• One of the strengths of R is that the system can easily be extended
• The system allows you to write new functions and package those functions in a so called ‘R
package’ (or ‘R library’)
• The R package may also contain other R objects, for example data sets or documentation
• There is a lively R user community and many R packages have been written and made
available on CRAN for other users
• When you start R not all of the downloaded packages are attached, only seven packages are
attached to the system by default
14 Chapter 1. Overview
1.8.8 Datasets
R comes with a number of sample datasets that you can experiment with.
Exercise 1.2
1 > data ()
2 # to see the available datasets . The results will depend on which
packages you have loaded . Type
3 > help ( datasetname )
4 # for details on a sample dataset
■
2. Foundation
Subtraction
1 >5 -3
2 2
Division
1 >1 / 2
2 0.5
Exponents
1 >2^3
2 8
3 >2 * * 3
4 8
16 Chapter 2. Foundation
Modulo Exponents
1 >5 %% 2
2 1
2.4 Comments
Comments are just everything that follows #. From a # to the end of the line, the R parser just
skips the text.
1 # This is a comment .
Exercise 2.1
1 help ( vector )
2.5 Print
We can use the print() function to print out variables or strings:
1 print ( " hello " )
2 [1] " hello "
1 x <- 10
2 print ( x )
3 [1] 10
Exercise 2.2
1 print ( mtcars )
2.6 Formatting
We can format strings and variables together for printing in a few different ways:
paste()
The paste() function looks like this: paste (..., sep = " ")
Where ... are the things you want to paste and sep is the separator you want between the pasted
items, by default it is a space. For example:
1 print ( paste ( ’ hello ’ , ’ world ’) )
2 [1] " hello world "
paste0()
paste0(..., collapse) is equivalent to paste(..., sep = "", collapse), slightly more efficiently.
1 paste0 ( ’ hello ’ , ’ world ’)
2 ’ helloworld ’
sprintf
srpintf() is a wrapper for the C function sprintf, that returns a character vector containing a format-
ted combination of text and variable values. Meaning you can use % codes to place in variables
by specifying all of them at the end. This is best shown through example:
1 sprintf ( " % s is % f feet tall \ n " , " Sven " , 7.1)
2 ’ Sven is 7.100000 feet tall ’
2.7 Variables
Rules for writing Identifiers in R
1. Identifiers can be a combination of letters, digits, period (.) and underscore (_)
18 Chapter 2. Foundation
2. It must start with a letter or a period. If it starts with a period, it cannot be followed by a
digit
3. Reserved words in R cannot be used as identifiers
Valid identifiers in R
■ Example 2.1 total, Sum, .fine.with.dot, this_is_acceptable, Number5 ■
Invalid identifiers in R
■ Example 2.2 tot@l, 5um, _fine, TRUE, .0ne ■
You can use the < − character to assign a variable, note how it kind of looks like an arrow
pointing from the object to the variable name.
1 # Use hashtags for comments
2 variable . name <- 100
You can assign with arrows in both directions, so you could also write the following:
1 2 -> x
An assignment won’t print anything if you write it into the R terminal, but you can get R to
print it just by putting the assignment in parentheses.
1 ( y <- " visible " )
2 [1] " visible "
1 x <- 1
2 y <- 3
3 z <- 4
4 x*y*z
2.8 Built-in Constants 19
5 12
6 x*Y*z
7 # # Error in eval ( expr , envir , enclos ) : object ’Y ’ not found
Integer - Natural (whole) numbers are known as integers and are also part of the numeric class
1 i <- 5
When you write numbers like 4 and 3, they are interpreted as floating-point numbers. To
explicitly get an integer, you must write 4L and 3L.
1 > class (4)
2 " numeric "
3 > class (4 L )
4 " integer "
Logical - Boolean values (True and False) are part of the logical class. In R these are written in
All Caps.
1 t <- TRUE
2 f <- FALSE
Characters - Text/string values are known as characters in R. You use quotation marks to create
a text character string:
1 char <- " Hello World ! "
2 char
3 ’ Hello World ! ’
20 Chapter 2. Foundation
We can create a vector by using the combine function c(). To use the function, we pass in the
elements we want in the array, with each individual element separated by a comma.
1 # Using c () to create a vector of numeric elements
2 > nvec <- c (1 ,2 ,3 ,4 ,5)
3 > class ( nvec )
4 ’ numeric ’
1 # Vector of characters
2 > cvec <- c ( ’U ’ , ’S ’ , ’A ’)
3 > class ( cvec )
4 ’ character ’
R Note that we CANNOT mix data types of the elements in an array, R will convert the other
elements in the array to force everything to be of the same data type.
Here’s a quick example of what happens with arrays given different data types:
2.11 Vector Basics 21
Vector Names
We can use the names() function to assign names to each element in our vector. For example,
imagine the following vector of a week of temperatures:
1 > temps <- c (72 ,71 ,68 ,73 ,69 ,75 ,71)
2 > temps
3 72 71 68 73 69 75 71
We know we have 7 temperatures for 7 weekdays, but which temperature corresponds to which
weekday? Does it start on Monday, Sunday, or another day of the week? This is where names()
can be assigned in the following manner:
1 > names ( temps ) <- c ( ’ Mon ’ , ’ Tue ’ , ’ Wed ’ , ’ Thu ’ , ’ Fri ’ , ’ Sat ’ , ’ Sun ’)
We also don’t have to rewrite the names vector over and over again, we can use simple use a
variable name as a names() assignment, for example:
1 > days <- c ( ’ Mon ’ , ’ Tue ’ , ’ Wed ’ , ’ Thu ’ , ’ Fri ’ , ’ Sat ’ , ’ Sun ’)
2 > temps2 <- c (1 ,2 ,3 ,4 ,5 ,6 ,7)
3 > names ( temps2 ) <- days
4 > temp2
5 Mon 1
6 Tue 2
7 Wed 3
8 Thu 4
9 Fri 5
10 Sat 6
11 Sun 7
Indexing works by using brackets and passing the index position of the element as a number.
Slicing
You can use a colon (:) to indicate a slice of a vector. The format is: vector[start_index:stop_index]
and you will get that “slice” of the vector returned to you. For example:
1 >v <- c (1 ,2 ,3 ,4 ,5 ,6 ,7 ,8 ,9 ,10)
2 >v [2:4]
3 2 3 4
4 >v [7:10]
5 7 8 9 10
Notice how the element st both the starting index and the stopping index are included.
Indexing with Names
We’ve previously seen how we can assign names to the elements in a vector, for example:
1 >v <- c (1 ,2 ,3 ,4)
2 > names ( v ) <- c ( ’a ’ , ’b ’ , ’c ’ , ’d ’)
We can use those names along with the indexing brackets to grab individual elements from the
array.
1 >v [ ’a ’]
2 a: 1
1 >v [v >2]
2 c 3
3 d 4
Let’s break this down to see how it works, we first get the vector v>2:
1 >v >2
2 a FALSE
3 b FALSE
4 C TRUE
5 D TRUE
Now we basically pass this vector of logicals through the brackets of the vector and only return
true values at the matching index positions:
1 >v [v >2]
2 c 3
3 d 4
We could also assign names to these logical vectors and pass them as well, for example:
1 > filter <- v >2
2 > filter
3 a FALSE
4 b FALSE
5 c TRUE
6 d TRUE
1 >v [ filter ]
2 c 3
3 d 4
Comparison Operators
In R we can use comparison operators to compare variables and return logical values. Let’s see
some relatively self-explanatory examples:
Greater Than
24 Chapter 2. Foundation
1 5 > 6
2 FALSE
3 6 > 5
4 TRUE
Be very careful with comparison operators and negative numbers! Use spacing to keep things
clear. An example of a dangerous situation:
1 var <- 1
2 var
3 1
Not Equal
1 5 != 2
2 TRUE
3 5 != 5
4 FALSE
Equal
1 5 == 5
2 TRUE
3 2 == 3
4 FALSE
Vector Comparisons
We can apply a comparison of a single number to an entire vector, for example:
1 v <- c (1 ,2 ,3 ,4 ,5)
2 v < 2
3 TRUE FALSE FALSE FALSE FALSE
2.12 Multiple Indexing 25
1 v == 3
2 FALSE FALSE TRUE FALSE FALSE
Adding Vectors
1 > v1 + v2
2 6 8 10
Subtracting Vectors
1 >v1 - v1
2 0 0 0
3 >v1 - v2
4 -4 -4 -4
Multiplying Vectors
1 > v1 * v2
2 5 12 21
Dividing Vectors
1 > v1 / v2
2 0.2 0.3333333 333333 33 0.4 285714 28571 429
We can also check for things like the standard deviation, variance, maximum element, minimum
element, product of elements:
1 v <- c (12 ,45 ,100 ,2)
2 # Standard Deviation
3 > sd ( v )
4 44.1691823182937
1 # Variance
2 > var ( v )
3 1950.91666666667
1 # Maximum Element
2 > max ( v )
3 100
26 Chapter 2. Foundation
1 # Minimum Element
2 > min ( v )
3 2
1 # Product of elements
2 > prod ( v1 )
3 6
4 > prod ( v2 )
5 210
2. Create a vector called stock.prices with the following data points: 23, 27, 23, 21, 34
3. Assign names to the price data points relating to the day of the week, starting with Mon,
Tue, Wed, etc...
4. What was the average (mean) stock price for the week? (You may need to reference a built-in
function)
5. Create a vector called over.23 consisting of logicals that correspond to the days where the
stock price was more than $23
6. Use the over.23 vector to filter out the stock.prices vector and only return the day and prices
where the price was over $23
7. Use a built-in function to find the day the price was the highest
4. Data Structure
4.1 Matrix
A matrix will allow us to have a 2-dimensional data structure which contains elements consisting
of the same data type.
R A quick tip for quickly creating sequential numeric vectors, you can use the colon notation
from slicing to create sequential vectors.
Exercise 4.1
1 >1:10
2 1 2 3 4 5 6 7 8 9 10
3 > v <- 1:10
4 1 2 3 4 5 6 7 8 9 10
To create a matrix, we use the matrix() function. We can pass in a vector into the matrix:
1 matrix ( v )
2 [ ,1]
3 [1 ,] 1
4 [2 ,] 2
5 [3 ,] 3
6 [4 ,] 4
7 [5 ,] 5
8 [6 ,] 6
9 [7 ,] 7
10 [8 ,] 8
11 [9 ,] 9
12 [10 ,] 10
Here we have a two-dimensional matrix which is 10 rows by 1 column. Now what if we want to
specify the number of rows?
30 Chapter 4. Data Structure
We can pass the parameter/argument into the matrix function called nrow which stands for number
of rows:
1 matrix (v , nrow =2)
2 [ ,1] [ ,2] [ ,3] [ ,4] [ ,5]
3 [1 ,] 1 3 5 7 9
4 [2 ,] 2 4 6 8 10
The byrow argument allows you to specify whether or not you want to fill out the matrix by rows
or by columns. For example:
1 matrix (1:12 , byrow = FALSE , nrow =4)
2 [ ,1] [ ,2] [ ,3]
3 [1 ,] 1 5 9
4 [2 ,] 2 6 10
5 [3 ,] 3 7 11
6 [4 ,] 4 8 12
Naming Matrices
It would be nice to name the rows and columns for reference. We can do this similarly to the
names() function for vectors, but in this case we define colnames() and rownames(). So let’s
name our stock matrix:
1 days <- c ( ’ Mon ’ , ’ Tue ’ , ’ Wed ’ , ’ Thu ’ , ’ Fri ’)
2 st . names <- c ( ’ GOOG ’ , ’ MSFT ’)
3 colnames ( stock . matrix ) <- days
4 rownames ( stock . matrix ) <- st . names
5 stock . matrix
6
7 Mon Tue Wed Thu Fri
8 GOOG 450 451 452 445 468
9 MSFT 230 231 232 236 228
Matrix Arithmetic
We can perform element by element mathematical operations on a matrix with a scalar (single
number) just like we could with vectors. Let’s see some quick examples:
1 > mat <- matrix (1:6 , byrow = TRUE , nrow =2)
2 > mat
3
4.1 Matrix 31
1 # Multiplication
2 2 * mat
3
4 [ ,1] [ ,2] [ ,3]
5 [1 ,] 2 4 6
6 [2 ,] 8 10 12
1 1 / mat
2
3 [ ,1] [ ,2] [ ,3]
4 [1 ,] 1.00 0.5 0.3333333
5 [2 ,] 0.25 0.2 0.1666667
1 mat / 2
2
3 [ ,1] [ ,2] [ ,3]
4 [1 ,] 0.5 1.0 1.5
5 [2 ,] 2.0 2.5 3.0
1 mat ^ 2
2
3 [ ,1] [ ,2] [ ,3]
4 [1 ,] 1 4 9
5 [2 ,] 16 25 36
1 mat / mat
2
3 [ ,1] [ ,2] [ ,3]
4 [1 ,] 1 1 1
5 [2 ,] 1 1 1
32 Chapter 4. Data Structure
1 mat ^ mat
2
3 [ ,1] [ ,2] [ ,3]
4 [1 ,] 1 4 27
5 [2 ,] 256 3125 46656
1 mat * mat
2
3 [ ,1] [ ,2] [ ,3]
4 [1 ,] 1 4 9
5 [2 ,] 16 25 36
Matrix multiplication
1 mat2 <- matrix (1:9 , nrow =3)
2 mat2 % * % mat2
3
Matrix Operations
Run the following code to create the stock.matrix from earlier
1 # Prices
2 goog <- c (450 ,451 ,452 ,445 ,468)
3 msft <- c (230 ,231 ,232 ,236 ,228)
4
5 # Put vectors into matrix
6 stocks <- c ( goog , msft )
7 stock . matrix <- matrix ( stocks , byrow = TRUE , nrow =2)
8
9 # Name matrix
10 days <- c ( ’ Mon ’ , ’ Tue ’ , ’ Wed ’ , ’ Thu ’ , ’ Fri ’)
11 st . names <- c ( ’ GOOG ’ , ’ MSFT ’)
12 colnames ( stock . matrix ) <- days
13 rownames ( stock . matrix ) <- st . names
1 # Display
2 stock . matrix
3
4 Mon Tue Wed Thu Fri
5 GOOG 450 451 452 445 468
6 MSFT 230 231 232 236 228
We can perform functions across the columns and rows, such as colSum() and rowSums():
1 colSums ( stock . matrix )
2
3 Mon Tue Wed Thu Fri
4 680 682 684 681 696
5
6 rowSums ( stock . matrix )
7
4.1 Matrix 33
8 GOOG MSFT
9 2266 1157
4 GOOG MSFT FB
5 453.2 231.4 120.2
Where the index notation (e.g. 1:5) is put in place of the rows or columns . If either rows or
columns is left blank, then we are selecting all the rows and columns.
1 mat <- matrix (1:12 , byrow = TRUE , nrow =3)
2 mat
3
4 [ ,1] [ ,2] [ ,3] [ ,4]
5 [1 ,] 1 2 3 4
6 [2 ,] 5 6 7 8
7 [3 ,] 9 10 11 12
5
6 # Grab first column
7 mat [ ,1]
8
9 [1] 1 5 9
10
11 # Grab first 3 rows
12 mat [1:2 ,]
13
14 [ ,1] [ ,2] [ ,3] [ ,4]
15 [1 ,] 1 2 3 4
16 [2 ,] 5 6 7 8
We want to convert the animal vector into information that an algorithm or equation can un-
derstand more easily. Meaning we want to begin to check how many categories (factor levels) are
in our character vector.
1 factor . ani <- factor ( animal )
2 # Will show levels as well on RStudio or R Console
3 factor . ani
4 [1] d c d c c
If you wanted to assign an order while using the factor() function, you can pass in the arguments
ordered=True and the pass in the levels= and pass in a vector in the order you want the levels to
be in. So for example:
This information is useful when used along with the summary() function which is an amazingly
convenient function for quickly getting information from a matrix or vector. For example:
1 summary ( fact . temp )
2
3 cold med hot
4 3 2 2
4.2 Dataframe Basics 35
We can notice some dataframe are really big, we can use the head() and tail() functions to
view the first and last 6 rows respectively.
1 states <- state . x77
2 head ( states )
10 # Output to csv
11 write . csv ( df , file = ’ some . file . csv ’)
1 # Column Names
2 colnames ( df )
3 " col . name .1 " " col . name .2 "
4
5 # Row names ( may just return index )
6 rownames ( df )
7 "1" "2" "3" "4" "5" "6" "7" "8" "9" " 10 "
4.2 Dataframe Basics 37
1 # Referencing Cells
2 vec <- df [[5 , 2]] # get cell by [[ row , col ]] num
3 newdf <- df [1:5 , 1:2] # get multiplt cells in new df
4 df [[2 , ’ col . name .1 ’ ]] <- 99999 # reassign a single cell
1 # Referencing Rows
2 rowdf <- df [1 , ]
3
4 # to get a row as a vector , use following
5 vrow <- as . numeric ( as . vector ( df [1 ,]) )
1 # Referencing Columns
2 cars <- mtcars
3 colv1 <- cars $ mpg
4 colv2 <- cars [ , ’ mpg ’]
5 colv3 <- cars [ , 1]
6 colv4 <- cars [[ ’ mpg ’ ]]
1 # Adding Rows
2 df2 <- data . frame ( col . name .1=2000 , col . name .2= ’ new ’ )
3 df2
4
5 # use rbind to bind a new row !
6 dfnew <- rbind ( df , df2 )
1 # Conditional Selection
2 sub1 <- df [ ( df $ col . name .1 > 8 & df $ col1 . times .2 > 10) , ]
3 sub1
4
5 sub2 <- subset ( df , col . name .1 > 8 & col1 . times .2 > 10)
6 sub2
We can use the same bracket notation we used for matrices: df[rows,columns]
1 # Everything from first row
2 df [1 ,]
3
4 # Everything from first column
5 df [ ,1]
4.2 Dataframe Basics 39
6
7 # Grab Friday data
8 df [5 ,]
If you want all the values of a particular column you can use the dollar sign directly after the
dataframe as follows: df.name$column.name
1 df $ rain
2 df $ days
You can also use bracket notation to return a data frame format of the same information:
1 df [ ’ rain ’]
2 df [ ’ days ’]
We could have also used the other column selection methods we learned:
1 sort . temp <- order ( df $ temp )
2 df [ sort . temp ,]
Using list()
We can use the list() to combine all the data structures:
1 li <- list (v ,m , df )
2 li
3
4 [[1]]
5 [1] 1 2 3 4 5
6
7 [[2]]
8 [ ,1] [ ,2] [ ,3] [ ,4] [ ,5]
9 [1 ,] 1 3 5 7 9
10 [2 ,] 2 4 6 8 10
1 [[3]]
2 height weight
3 1 58 115
4 2 59 117
5 3 60 120
6 4 61 123
7 5 62 126
8 6 63 129
9 7 64 132
10 8 65 135
11 9 66 139
4.3 R Lists Basics 41
12 10 67 142
13 11 68 146
14 12 69 150
15 13 70 154
16 14 71 159
17 15 72 164
The list() assigned numbers to each of the objects in the list, but we can also assign names in the
following manner:
1 li <- list ( sample _ vec = v , sample _ mat = m , sample _ df = df )
2 # Ignore the " error in vapply " , this won ’ t occur in RStudio !
3 li
4
5 $ sample _ vec
6 [1] 1 2 3 4 5
7
8 $ sample _ mat
9 [ ,1] [ ,2] [ ,3] [ ,4] [ ,5]
10 [1 ,] 1 3 5 7 9
11 [2 ,] 2 4 6 8 10
1 $ sample _ df
2 height weight
3 1 58 115
4 2 59 117
5 3 60 120
6 4 61 123
7 5 62 126
8 6 63 129
9 7 64 132
10 8 65 135
11 9 66 139
12 10 67 142
13 11 68 146
14 12 69 150
15 13 70 154
16 14 71 159
17 15 72 164
Combining lists
Lists can hold other lists! You can also combine lists using the combine function c():
1 double _ list <- c ( li , li )
2 str ( double _ list )
3
4 List of 6
5 $ sample _ vec : num [1:5] 1 2 3 4 5
6 $ sample _ mat : int [1:2 , 1:5] 1 2 3 4 5 6 7 8 9 10
7 $ sample _ df : ’ data . frame ’: 15 obs . of 2 variables :
8 .. $ height : num [1:15] 58 59 60 61 62 63 64 65 66 67 ...
9 .. $ weight : num [1:15] 115 117 120 123 126 129 132 135 139 142 ...
10 $ sample _ vec : num [1:5] 1 2 3 4 5
11 $ sample _ mat : int [1:2 , 1:5] 1 2 3 4 5 6 7 8 9 10
12 $ sample _ df : ’ data . frame ’: 15 obs . of 2 variables :
13 .. $ height : num [1:15] 58 59 60 61 62 63 64 65 66 67 ...
14 .. $ weight : num [1:15] 115 117 120 123 126 129 132 135 139 142 ...
5. Exercises : Data Structure
1. Create 2 vectors A and B, where A is (1,2,3) and B is (4,5,6). With these vectors, use the
cbind() or rbind() function to create a 2 by 3 matrix from the vectors. You’ll need to figure out
which of these binding functions is the correct choice.
2. Create a 3 by 3 matrix consisting of the numbers 1-9. Create this matrix using the shortcut
1:9 and by specifying the nrow argument in the matrix() function call. Assign this matrix to the
variable mat
4. Create a 5 by 5 matrix consisting of the numbers 1-25 and assign it to the variable mat2.
The top row should be the numbers 1-5.
5.Using indexing notation, grab a sub-section of mat2 from the previous exercise that looks like
this:
[7,8]
[12,13]
6. Using indexing notation, grab a sub-section of mat2 from the previous exercise that looks
like this:
[19,20]
[24,25]
8. Find out how to use runif() to create a 4 by 5 matrix consisting of 20 random numbers (4*5=20).
44 Chapter 5. Exercises : Data Structure
9. Set the built-in data frame mtcars as a variable df. We’ll use this df variable for the rest of
the exercises.
11. Select the rows where all cars have 6 cylinders (cyl column)
13. Your performance column will have several decimal place precision. Figure out how to use
round() (check help(round)) to reduce this accuracy to only 2 decimal places.
14. What is the average mpg for cars that have more than 100 hp AND a wt value of more
than 2.5
1 x > 5
2 TRUE
3
We can also add parenthesis for readability and to make sure the order of comparisons is what we
expect:
1 ( x < 20) & (x >5)
2 TRUE
3
4 ( x < 20) & (x >5) & ( x == 10)
5 TRUE
46 Chapter 6. Control Structure
NOT!
You can think about NOT as reversing any logical value in front of it, basically asking, “Is this
NOT true?” For example:
1 (10==1)
2 FALSE
3
4 ! (10==1)
5 TRUE
6
7 # We can stack them ( pretty uncommon , but possible )
8 ! ! (10==1)
9 FALSE
1 df <- mtcars
2 df [ df [ ’ mpg ’] >= 20 ,] # Notice the use of indexing with the comma
3 # subset ( df , mpg >=20) # Could also use subset
Let’s combine filters with logical operators! Let’s grab rows with cars of at least 20mpg and over
100 hp.
1 df [( df [ ’ mpg ’] >= 20) & ( df [ ’ hp ’] > 100) ,]
4 tt || tf
5 TRUE
6
7 tt || ft
8 TRUE
9
10 tt & & tf
11 TRUE
6.2 Use Case Example 47
We say if some condition is true then execute the code inside of the curly brackets.
For example, let’s say we have two variables, hot and temp. Imagine that hot starts off as FALSE
and temp is some number in degrees. If the temp is greater than 80 than we want to assign
hot==TRUE. Let’s see this in action
1 hot <- FALSE
2 temp <- 60
3 if ( temp > 80) {
4 hot <- TRUE
5 }
6 hot
7 [1] FALSE
1 # Reset temp
2 temp <- 100
3
4 if ( temp > 80) {
5 hot <- TRUE
6
7 }
8
9 hot
10 [1] TRUE
else if
What if we wanted more options to print out, rather than just two, the if and the else? This is
where we can use the else if statement to add multiple condition checks, using else at the end to
execute code if none of our conditions match up with and if or else if.
1 temp <- 30
2
3 if ( temp > 80) {
4 print ( " Hot outside ! " )
5 } else if ( temp <80 & temp >50) {
6 print ( ’ Nice outside ! ’)
7 } else if ( temp <50 & temp > 32) {
8 print ( " Its cooler outside ! " )
9 } else {
10 print ( " Its really cold outside ! " )
11 }
12
13 [1] " Its really cold outside ! "
48 Chapter 6. Control Structure
1 temp <- 75
2
3 if ( temp > 80) {
4 print ( " Hot outside ! " )
5 } else if ( temp <80 & temp >50) {
6 print ( ’ Nice outside ! ’)
7 } else if ( temp <50 & temp > 32) {
8 print ( " Its cooler outside ! " )
9 } else {
10 print ( " Its really cold outside ! " )
11 }
12
Final Example
Let’s see a final more elaborate example of if,else, and else if :
1 # Items sold that day
2 ham <- 10
3 cheese <- 10
4
5 # Report to HQ
6 report <- ’ blank ’
7 if ( ham >= 10 & cheese >= 10) {
8 report <- " Strong sales of both items "
9 } else if ( ham == 0 & cheese == 0) {
10 report <- " Nothing sold ! "
11 } else {
12 report <- ’ We had some sales ’
13 }
14 print ( report )
15
16 [1] " Strong sales of both items "
8 [1] 4
9 [1] 5
The other way would be to loop a numbered amount of times and then use indexing to continually
grab from the vector:
1 for ( i in 1: length ( vec ) ) {
2 print ( vec [ i ])
3 }
4 [1] 1
5 [1] 2
6 [1] 3
7 [1] 4
8 [1] 5
1 for ( i in 1: length ( li ) ) {
2 print ( li [[ i ]]) # Remember to use double brackets !
3 }
5 }
6 [1] " The element at row : 1 and col : 1 is 1"
7 [1] " The element at row : 1 and col : 2 is 6"
8 [1] " The element at row : 2 and col : 1 is 2"
9 [1] " The element at row : 2 and col : 2 is 7"
10 [1] " The element at row : 3 and col : 1 is 3"
11 [1] " The element at row : 3 and col : 2 is 8"
12 ......
13 [1] " The element at row : 5 and col : 2 is 10 "
1 x <- 0
2
1 x <- 0
2 while ( x < 10) {
3 cat ( ’x is currently : ’ ,x )
4 print ( ’ x is still less than 10 , adding 1 to x ’)
5 # add one to x
6 x <- x +1
7 if ( x ==10) {
8 print ( " x is equal to 10 ! Terminating loop " )
9 }
10 }
break
You can use break to break out of a loop.
1 x <- 0
2 while ( x < 5) {
3 cat ( ’x is : ’ ,x , sep = " " )
4 print ( ’ x is still less than 5 , adding 1 to x ’)
6.4 while loops 51
5 # add one to x
6 x <- x +1
7 if ( x ==5) {
8 print ( " x is equal to 5 ! " )
9 print ( " I will also print , woohoo ! " )
10 }
11 }
12 x is :0[1] " x is still less than 5 , adding 1 to x"
13 x is :1[1] " x is still less than 5 , adding 1 to x"
14 x is :2[1] " x is still less than 5 , adding 1 to x"
15 x is :3[1] " x is still less than 5 , adding 1 to x"
16 x is :4[1] " x is still less than 5 , adding 1 to x"
17 [1] " x is equal to 5 ! "
18 [1] " I will also print , woohoo ! "
1 x <- 0
2 while ( x < 5) {
3 cat ( ’x is : ’ ,x , sep = " " )
4 print ( ’ x is less than 5 , adding 1 to x ’)
5 # add one to x
6 x <- x +1
7 if ( x ==5) {
8 print ( " x is equal to 5 ! " )
9 break
10 print ( " I will also print , woohoo ! " )
11 }
12 }
13 x is :0[1] " x is less than 5 , adding 1 to x "
14 x is :1[1] " x is less than 5 , adding 1 to x "
15 x is :2[1] " x is less than 5 , adding 1 to x "
16 x is :3[1] " x is less than 5 , adding 1 to x "
17 x is :4[1] " x is less than 5 , adding 1 to x "
18 [1] " x is equal to 5 ! "
7. Exercises : Control Structure
1. Write a script that will print “Even Number” if the variable x is an even number, otherwise
print “Not Even”.
2.Write a script that will print ’Is a Matrix’ if the variable x is a matrix, otherwise print “Not
a Matrix”. Hint: You may want to check out help(is.matrix)
3. Create a script that given a numeric vector x with a length 3, will print out the elements in
order from high to low.
4.Write a script that uses if,else if, and else statements to print the max element in a numeric
vector with 3 elements.
8. Function
Example
1 hello <- function () {
2 print ( ’ hello ! ’)
3 }
4 hello ()
5 [1] " hello ! "
Default Values
We have had to define every single argument in the function when using it, but we can also have
default values by using an equals sign, for example:
1 hello _ someone <- function ( name = ’ Frankie ’) {
2 print ( paste ( ’ Hello ’ , name ) )
3 }
4 # uses default
5 hello _ someone ()
6 [1] " Hello Frankie "
7
8 # overwrite default
9 hello _ someone ( ’ Sammy ’)
10 [1] " Hello Sammy "
Returning Values
If we wanted to return the results so that we could assign them to a variable, we can use the return
keyword for this task in the following manner:
1 formal <- function ( name = ’ Sam ’ , title = ’ Sir ’) {
2 return ( paste ( title , ’ ’ , name ) )
3 }
4 var <- formal ( ’ Marie Curie ’ , ’ Ms . ’)
5 var
6 [1] " Ms . Marie Curie "
Scope
• Scope is the term we use to describe how objects and variable get defined within R
• if a variable is defined only inside a function than its scope is limited to that function
1 times5 <- function ( input ) {
2 result <- input ^ 2
3 return ( result )
4 }
5 result
6 Error : object ’ result ’ not found
7 input
8 Error : object ’ input ’ not found
These error indicate that these variables are only defined inside the scope of the function.
1 v <- "I ’ m global v "
2 stuff <- "I ’ m global stuff "
3
8 }
9
10 print ( v ) # print v
11 print ( stuff ) # print stuff
12 fun ( stuff ) # pass stuff to function
13 # reassignment only happens in scope of function
14 print ( stuff )
15
16 [1] "I ’ m global v "
17 [1] "I ’ m global stuff "
18 [1] "I ’ m global v "
19 [1] " Reassign stuff inside func "
20 [1] "I ’ m global stuff "
2. Create a function that accepts two arguments, an integer and a vector of integers. It returns
TRUE if the integer is present in the vector, otherwise it returns FALSE. Make sure you pay care-
ful attention to your placement of the return(FALSE) line in your function.
3. Create a function that accepts two arguments, an integer and a vector of integers. It returns
the count of the number of occurences of the integer in the input vector.
4. We want to ship bars of aluminum. We will create a function that accepts an integer repre-
senting the requested kilograms of aluminum for the package to be shipped. To fullfill these order,
we have small bars (1 kilogram each) and big bars (5 kilograms each). Return the least number of
bars needed.
For example, a load of 6 kg requires a minimum of two bars (1 5kg bars and 1 1kg bars). A
load of 17 kg requires a minimum of 5 bars (3 5kg bars and 2 1kg bars).
5. Create a function that accepts 3 integer values and returns their sum. However, if an integer
value is evenly divisible by 3, then it does not count towards the sum. Return zero if all numbers
are evenly divisible by 3. Hint: You may want to use the append() function.
6. Create a function that will return TRUE if an input integer is prime. Otherwise, return FALSE.
You may want to look into the any() function.
10. Statistics
• Correlation
• Linear Regression
• Comparing 2 means
• ANOVA
10.2 R Commander
10.3 Correlation
• It is a way of measuring the extent to which two variables are relate
• It measures the pattern of responses across variables
• Correlation and Causality
– In any correlation, causality between two variables cannot be assumed because there
may be other measured or unmeasured variables affecting the results
– Correlation coefficients say nothing about which variable causes the other to change
To compute basic correlation coefficients there are three main functions that can be used:
• cor()
• cor.test()
• rcorr()
• It is an effect size
– +.1 or −.1 = small effect
1 cor ( examData , use = " complete . obs " , method = " pearson " )
2
3 rcorr ( examData , type = " pearson " )
4
5 cor . test ( examData $ Exam , examData $ Anxiety , method = " pearson " )
64 Chapter 10. Statistics
10.4 Regression
• A way of predicting the value of one variable from another
– It is a hypothetical model of the relationship between two variables
– The model used is a linear one
The Regression Equation
Ŷ = bX + a
Ŷ = a + b1 X1 + b2 X2
where
Ŷ is the predicted value of the dependent variable,
a is the intercept,
10.5 Comparing 2 means (t-test) 65
Multiregression with R
1 > results = lm ( Y ~ X1 + X2 )
2 > summary ( results )
From the output, you see that the prediction equation is:
How?
If you have the data for different groups stored in a single column:
1 newModel <-t . test ( outcome ~ predictor , data = dataFrame , paired =
FALSE / TRUE )
2
If you have the data for different groups stored in two columns:
1 newModel <-t . test ( scores group 1 , scores group 2 , paired = FALSE / TRUE
)
2
3 ind . t . test <-t . test ( spiderWide $ real , spiderWide $ picture )
If we had our data stored in long format so that our group scores are in a single column and group
membership is expressed in a second column:
1 dep . t . test <-t . test ( Anxiety ~ Group , data = spiderLong , paired =
TRUE )
Using aov():
1 viagraModel <- aov ( libido ~ dose , data = viagraData )
2 summary ( viagraModel )
• The idea that you can build every graph from the same few components: a data set, a set of
geomsvisual marks that represent data points, and a coordinate system.
• To display data values, map variables in the data set to aesthetic properties of the geom like
size, color, and x and y locations
Geoms in ggplot2
1 # import ggplot2
2 library ( ggplot2 )
Stat
A statistical transformation, or stat, transforms the data, typically by summarizing it in some man-
ner.
A stat takes a dataset as input and returns a dataset as output, and so a stat can add new variables
to the original dataset.
1 ggplot ( diamonds , aes ( carat ) ) + geom _ histogram ( aes ( y = .. density ..) ,
binwidth = 0.1)
Stats in ggplot2
72 Chapter 11. Data Visualization
Position adjustments
The different types of adjustment are best illustrated with a bar chart.
Using ggplot2
Exercise 11.1 Quick Example with Histograms. We have a couple of options for quickly
producing histograms off the columns of a data frame.
• hist()
• qplot()
• ggplot()
■
Histogram of df$Home.Value
2000
1500
Frequency
1000
500
0
df$Home.Value
Using qplot
library(ggplot2)
1 library ( ggplot2 )
2 qplot ( df $ Home . Value )
74 Chapter 11. Data Visualization
Using ggplot
1 library ( ggplot2movies )
2 df <- movies <- movies [ sample ( nrow ( movies ) , 1000) , ]
3
Adding Color
Adding Labels
Linetypes
We have the options: "blank", "solid", "dashed", "dotted", "dotdash", "longdash", and "twodash".
Advanced Aesthetics
We can add a aes() argument to the geom_histogram for some more advanced features. But, ggplot
gives you the ability to edit color and fill scales.
1 # Adding Labels
2 pl <- ggplot ( df , aes ( x = rating ) )
3 pl + geom _ histogram ( binwidth =0.1 , aes ( fill =.. count ..) ) + xlab ( ’ Movie
Ratings ’) + ylab ( ’ Occurences ’)
11.1 Data Visualization 81
You can further edit this by adding the scale_fill_gradient() function to your ggplot objects:
1 # Adding Labels
2 pl <- ggplot ( df , aes ( x = rating ) )
3 pl2 <- pl + geom _ histogram ( binwidth =0.1 , aes ( fill =.. count ..) ) + xlab
( ’ Movie Ratings ’) + ylab ( ’ Occurences ’)
4
5 # scale _ fill _ gradient ( ’ Label ’ , low = color1 , high = color2 )
6 pl2 + scale _ fill _ gradient ( ’ Count ’ , low = ’ blue ’ , high = ’ red ’)
82 Chapter 11. Data Visualization
1 # Adding Labels
2 pl <- ggplot ( df , aes ( x = rating ) )
3 pl + geom _ histogram ( aes ( y =.. density ..) ) + geom _ density ( color = ’ red ’)
84 Chapter 11. Data Visualization
Scatter plots allow us to place points that let us see possible correlations between two features of
a data set.
1 library ( ’ ggplot2 ’)
2 df <- mtcars
3
4 pl <- ggplot ( data = df , aes ( x = wt , y = mpg ) )
5 pl + geom _ point ()
11.1 Data Visualization 85
1 # With Shapes
11.1 Data Visualization 87
1 # Better version
2 # With Shapes
3 pl <- ggplot ( data = df , aes ( x = wt , y = mpg ) )
4 pl + geom _ point ( aes ( shape = factor ( cyl ) , color = factor ( cyl ) ) , size =4 ,
alpha =0.6)
88 Chapter 11. Data Visualization
Gradient Scales
1 pl + geom _ point ( aes ( colour = hp ) , size =4) + scale _ colour _ gradient (
high = ’ red ’ , low = " blue " )
• By default, geom_bar uses stat=“count” which makes the height of the bar proportion to the
number of cases in each group
• If you want the heights of the bars to represent values in the data, use stat=“identity" and
map a variable to the y aesthetic
1 library ( ggplot2 )
2 # counts ( or sums of weights )
3 g <- ggplot ( mpg , aes ( class ) )
4 # Number of cars in each class :
5 g + geom _ bar ()
1 df <- mtcars
2 pl <- ggplot ( mtcars , aes ( factor ( cyl ) , mpg ) )
3 pl + geom _ boxplot ()
1 pl + geom _ boxplot ( fill = " grey " , color = " blue " )
11.1 Data Visualization 93
1 library ( ggplot2 )
2 pl <- ggplot ( mpg , aes ( x = displ , y = hwy ) ) + geom _ point ()
3 pl
94 Chapter 11. Data Visualization
A sometimes nicer way to do this is by adding + coord_cartesian() with xlim and ylim argu-
ments and pass in numeric vectors.
11.1.10 Facets
The best way to set up a facet grid (multiple plots) is to use facet_grid()
11.1.11 Themes
There are a lot of built-in themes in ggplot and you can use them in two ways, by stating before
your plot to set the theme:
1 my _ plot + theme _ bw ()
There is also a great library called ggthemes which adds even more built-in themes for ggplot.
You can also customize your own themes
11.1 Data Visualization 99
Themes elements
1 library ( ggplot2 )
2 df <- mtcars
3 pl <- ggplot ( df , aes ( x = mpg , y = hp ) ) + geom _ point ()
4 print ( pl )
100 Chapter 11. Data Visualization
1 pl + theme _ bw ()
1 pl + theme _ classic ()
11.1 Data Visualization 101
• We can improve on Bag of Words by adjusting word counts based on their frequency in
106 Chapter 13. Text Mining Application
substr() - returns the substring in the given character range start:stop for the given
1 substr ( ’ abcdefg ’ , start =2 , stop = 5)
2
3 ’ bcde ’
strsplit() - splits a string into a list of substrings based on another string split in x
1 strsplit ( ’ 2016 -01 -23 ’ , split = ’ - ’)
2
3 ’ 2016 ’ ’ 01 ’ ’ 23 ’
13.2 Text Mining Application with R 107
1 library ( rtweet )
2
3 # # get user IDs of accounts followed by BBC
4 bbc _ fds = get _ friends ( " bbc " )
5 # # lookup data on those accounts
6 bbc _ fds _ data = lookup _ users ( bbc _ fds $ user _ id )
7 head ( bbc _ fds _ data )
8 # # get user IDs of accounts following bbc
9 bbc _ flw = get _ followers ( " bbc " , n = 1000)
10 # # lookup data on those accounts
11 bbc _ flw _ data = lookup _ users ( bbc _ flw $ user _ id )
12 head ( bbc _ flw _ data )
13 # # get user IDs of accounts followed by CNN
14 tmls = get _ timelines ( c ( " cnn " , " BBCWorld " , " foxnews " ) , n = 3200)
15 head ( tmls )
16 tmls = as . data . frame ( tmls )
17 head ( tmls })
1 library ( rtweet )
2 library ( dplyr )
3 library ( tidytext )
4 library ( ggplot2 )
5
6 climate _ tweets <- search _ tweets ( q = " # climatechange " , n = 1000 ,
lang = " en " , include _ rts = FALSE )
7 head ( climate _ tweets $ text )
8 # remove http elements manually
9 climate _ tweets $ stripped _ text <- gsub ( " http . * " ," " , climate _ tweets $
text )
10 climate _ tweets $ stripped _ text <- gsub ( " https . * " ," " , climate _ tweets $
stripped _ text )
11 # remove punctuation , convert to lowercase , add id for each tweet !
12 climate _ tweets _ clean <- climate _ tweets % >% dplyr :: select ( stripped _
text ) % >% unnest _ tokens ( word , stripped _ text )
13 cc = climate _ tweets _ clean % >% anti _ join ( stop _ words ) # # remove stop
words
14 head ( cc )
15 # plot the top 15 words -- notice any issues ?
16 top = cc % >% coun
17
18 top % >%
19 ggplot ( aes ( x = word , y = n ) ) +
20 geom _ col () +
21 xlab ( NULL ) +
22 coord _ flip () +
23 labs ( x = " Count " , y = " Unique words " , title = " Count of unique
words found in tweets " ) +
24 theme _ bw ()
110 Chapter 13. Text Mining Application
Search Group
1 library ( Rfacebook )
2 token = mytoken
3 ids <- searchGroup ( name = " rusers " , token = token )
Search Page
1 # # search pages relating to Thailand
2 sp = searchPages ( " Thailand " , token = token , n =15)
3 View ( sp )
4 head ( post )
7
8 urldata <- getURL ( url ) # get data from this URL
9 data <- readHTMLTable ( urldata , stringsAsFactors = FALSE )
10 # read the hHTML table
11
12 # medal tally
13 names ( data )
14 head ( data )
15 x = data $ ‘2016 Summer Olympics medal table ‘
16 head ( x )
1 library ( readr )
2 library ( tm )
3 library ( wordcloud )
4 s = read . csv ( " mugabe1 . csv " )
5 head ( s )
6 names ( s )
7
8 text <- as . character ( s $ text )
9 # # carry out text data cleaning - gsub
10 some _ txt <- gsub ( " ( RT | via ) ((?:\\ b \\ w * @ \\ w +) +) " ," " ,s $ text )
11 some _ txt <- gsub ( " http [^[: blank :]]+ " ," " , some _ txt )
12 some _ txt <- gsub ( " @ \\ w + " ," " , some _ txt )
13 some _ txt <- gsub ( " [[: punct :]] " ," " , some _ txt )
14 some _ txt <- gsub ( " [^[: alnum :]] " ," " , some _ txt )
15 some _ txt = as . character ( some _ txt )
16 library ( syuzhet )
17 tweetSentiment <- get _ nrc _ sentiment ( text )
18 # syuzhet pkg
19 # Calls the NRC sentiment dictionary to calculate the presence of
20 # eight different emotions and their corresponding valence in a text
file .
21
22 barplot ( sort ( colSums ( prop . table ( tweetSentiment [ , 1:8]) ) ) , cex .
names = 0.7 , las = 1 , main = " Emotions in Tweets text " , xlab = "
Percentage " )
1 # # Tidy Sentiments
2 library ( janeaustenr )
3 library ( dplyr )
4 library ( tm )
5 library ( tidytext )
6 library ( tidyverse )
7 library ( qdapTools )
8 library ( ggplot2 )
9
• For example, a piece of equipment could have data points labeled “F” (failed) or “R” (runs)
• The learning algorithm receives a set of inputs along with the corresponding correct outputs,
and the algorithm learns by comparing its actual output with correct outputs to find errors
• Through methods like classification, regression, prediction and gradient boosting, supervise
learning uses patterns to predict the values of the label on additional unlabeled data
• Supervised learning is commonly used in applications where historical data predicts likely
future events
• For example, it can anticipate where credit card transactions are likely to be fraudulent or
which insurance customer is likely to file a claim
• Or it can attempt to predict the price of a house based on different features for houses for
which we have historical price data
• These algorithms are also used to segment text topics, recommend items and identify data
outliers
Example
Let’s take Shaquille O’Neal as an example. Shaq is really tall:7ft 1in (2.2 meters). If Shaq has
a son, chances are he’ll be pretty tall too. However, Shaq is such an anomaly that there is also a
very good chance that his son will be not be as tall as Shaq.
Turns out this is the case: Shaq’s son is pretty tall (6 ft 7 in), but not nearly as tall as his dad.
Galton called this phenomenon regression, as in “A father’s son’s height tends to regress (or drift
towards) the mean (average) height."
Let’s take the simplest possible example: calculating a regression with only 2 data points.Let’s
take the simplest possible example: calculating a regression with only 2 data points.
118 Chapter 14. Machine Learning
All we’re trying to do when we calculate our regression line is draw a line that’s as close to every
dot as possible.
For classic linear regression, or “Least Squares Method", you only measure the closeness in
the “up and down" direction.
Now wouldn’t it be great if we could apply this same concept to a graph with more than just two
data points?
Our goal with linear regression is to minimize the vertical distance between all the data
points and our line. So in determining the best line, we are attempting to minimize the distance
between all the points and their distance to our line.
There are lots of different ways to minimize this, (sum of squared errors, sum of absolute
errors, etc), but all these methods have a general goal of minimizing this distance.
14.6 Linear Regression 119
Remember that Linear Regression is a supervised learning algorithm, meaning we’ll have labeled
data and try to predict new labels on unlabeled data. We’ll explore some of the following concepts:
• Get our Data
• Exploratory Data Analysis (EDA)
• Clean our Data
• Review of Model Form
• Train and Test Groups
• Linear Regression Model
120 Chapter 14. Machine Learning
1 any ( is . na ( df ) )
2 FALSE
Categorical Features
Moving on, let’s make sure that categorical variables have a factor set to them. For example, the
MJob column refers to categories of Job Types, not some numeric value from 1 to 5. R is actually
really good at detecting these sort of values and will take of this work for you a lot of the time,
but always keep in mind the use of factor() as a possible. Luckily this is basically already, we can
check this using the str() function:
1 str ( df )
Building a Model
The general model of building a linear regression model in R look like this:
1 model <- lm ( y $ \ sim $ x1 + x2 , data )
Model Interpretation
Looks like Absences, Farmrel, G1, and G2 scores are good predictors. With age and activities
aslo possiblby contributing to a good model.
Predictions
Let’s test our model by predicting on our testing set:
1 G3 . predictions <- predict ( model , test )
Now we can get the root mean squared error, a standardized measure of how off we were with our
predicted values:
122 Chapter 14. Machine Learning
Now let’s take care of negative predictions! Lot’s of ways to this, here’s a more complicated way,
but its a good example of creating a custom function for a custom problem:
1 to _ zero <- function ( x ) {
2 if ( x < 0) {
3 return (0)
4 } else {
5 return ( x )
6 }
7 }
There’s lots of ways to evaluate the prediction values, for example the MSE (mean squared error):
1 mse <- mean (( results $ real - results $ pred ) ^2)
2 print ( mse )
3 [1] 4.411405
Training Algorithm:
Store all the Data
Prediction Algorithm:
1. Calculate the distance from x to all points in your data
2. Sort the points in your data by increasing distance from x
3. Predict the majority label of the “k” closest points
14.7 K Nearest Neighbors (KNN) 123
Pros
• Very simple
• Training is trivial
• Works with any number of classes
• Easy to add more data
• Few parameters
– K
– Distance Metric
Cons
• High Prediction Cost (worse for large data sets)
• Not good with high dimensional data
• Categorical Features don’t work well
■ Example 14.2 K Nearest Neighbors ■
124 Chapter 14. Machine Learning
We’ll use the ISLR package to get the data, you can download it with the code below. Remember
to call the library as well.
1 install . packages ( " ISLR " )
2 library ( ISLR )
We will apply the KNN approach to the Caravan data set, which is part of the ISLR library.
This data set includes 85 predictors that measure demographic characteristics for 5,822 individuals.
The response variable is Purchase, which indicates whether or not a given individual purchases a
Caravan insurance policy. In this data set, only 6% of people purchased caravan insurance.
Let’s look at the structure:
1 str ( Caravan )
2 summary ( Caravan $ Purchase )
Cleaning Data
Let’s just remove any NA values by dropping the rows with them.
1 any ( is . na ( Caravan ) )
Standardize Variables
Because the KNN classifier predicts the class of a given test observation by identifying the obser-
vations that are nearest to it, the scale of the variables matters. Any variables that are on a large
scale will have a much larger effect on the distance between the observations, and hence on the
KNN classifier, than variables that are on a small scale.
Clearly the scales are different! We are now going to standarize all the X variables except Y
(Purchase). The Purchase variable is in column 86 of our dataset, so lets save it in a separate
variable because the knn() function needs it as a separate argument.
1 # save the Purchase column in a separate variable
2 purchase <- Caravan [ ,86]
3
4 # Standarize the dataset using " scale () " R function
5 standardized . Caravan <- scale ( Caravan [ , -86])
We can see that now that all independent variables (Xs) have a mean of 1 and standard deviation
of 0. Great, then lets divide our dataset into testing and training data. We’ll just do a simple split
of the first 1000 rows as a test set:
1 # First 100 rows for test set
2 test . index <- 1:1000
14.7 K Nearest Neighbors (KNN) 125
Now lets evaluate the model we trained and see our misclassification error rate.
1 mean ( test . purchase ! = predicted . purchase )
2 0.116
Choosing K Value
Let’s see what happens when we choose a different K value:
1 predicted . purchase <- knn ( train . data , test . data , train . purchase , k =3)
2 mean ( test . purchase ! = predicted . purchase )
3 0.073
Should we manually change k and see which k gives us the minimal misclassification rate? NO!
we have computers, so lets automate the process with a for() loop. A loop in R repeats the same
command as much as you specify. For example, if we want to check for k =1 up to 100, then we
have to write 3 x 100 lines of code, but with a for loop, you just need 4 lines of code, and you can
repeat those 3 lines up to as many as you want. (Note this may take awhile because you’re running
the model 20 times!)
1 predicted . purchase = NULL
2 error . rate = NULL
3
4 for ( i in 1:20) {
5 set . seed (101)
6 predicted . purchase = knn ( train . data , test . data , train . purchase , k = i )
126 Chapter 14. Machine Learning
Elbow Method
We can plot out the various error rates for the K values. We should see an “elbow” indicating that
we don’t get a decrease in error rate for using a higher K. This is a good cut-off point:
1 library ( ggplot2 )
2 k . values <- 1:20
3 error . df <- data . frame ( error . rate , k . values )
4 ggplot ( error . df , aes ( x = k . values , y = error . rate ) ) + geom _ point () + geom _
line ( lty = " dotted " , color = ’ red ’) + theme _ bw ()
Imagine that I play Tennis every Saturday and I always invite a friend to come with me. Sometimes
my friend shows up, sometimes not. For him it depends on a variety of factors, such as: weather,
temperature, humidity, wind etc..
I start keeping track of these features and whether or not he showed up to play with me.
14.8 Tree Methods 127
I want to use this data to predict whether or not he will show up to play. An intuitive way to do
this is through a Decision Tree.
Imaginary Data with 3 features (X,Y, and Z ) with two possible classes.
128 Chapter 14. Machine Learning
We can then use the rpart() function to build decision tree model:
where:
14.9 Random Forests 129
Sample Data
We’ll use the kyphosis data frame which has 81 rows and 4 columns. representing data on children
who have had corrective spinal surgery. It has the following columns:
• Kyphosis-a factor with levels absent present indicating if a kyphosis (a type of deformation)
was present after the operation.
• Age-in months
1 printcp ( tree )
Tree Visualization
We would like to choose a hyperplane that maximizes the margin between classes.
The vector points that the margin lines touch Support Vectors.
Example Predictions
We have a small data set, so instead of splitting it into training and testing sets (which you should
always try to do!) we’ll just score out model against the same data it was tested against:
Advanced - Tuning
We can try to tune parameters to attempt to improve our model, you can refer to the help() docu-
mentation to understand what each of these parameters stands for. We use the tune function:
We can now see that the best performance occurs with cost=1 and gamma=0.5. You could try to
train the model again with these specific parameters in hopes of having a better model:
1 tuned . svm <- svm ( Species ~ . , data = iris , kernel = " radial " , cost =1 ,
gamma =0.5)
2 summary ( tuned . svm )
3 tuned . predicted . values <- predict ( tuned . svm , iris [1:4])
4 table ( tuned . predicted . values , iris [ ,5])
5
6 tuned . predicted . values setosa versicolor virginica
7 setosa 50 0 0
8 versicolor 0 48 2
9 virginica 0 2 48
Looks like we weren’t able to improve on our model! The concept of trying to tune for param-
eters by just trying many combinations in generally known as a grid search. In this case, we likely
have too little data to actually improve our model through careful parameter selection.
The R function fviz_nbclust() [in factoextra package] provides a convenient solution to estimate
the optimal number of clusters.
1 library ( factoextra )
2 data ( " USArrests " ) # Loading the data set
3 df <- scale ( USArrests ) # Scaling the data
4 fviz _ nbclust ( df , kmeans , method = " wss " ) + geom _ vline ( xintercept =
4 , linetype = 2)
14.11 K Means Clustering 135
Usually when dealing with an unsupervised learning problem, its difficult to get a good measure
of how well the model performed. For this example, we will use data from the UCI archive based
off of red and white wines (this is a very commonly used data set in ML).
We will then add a label to the a combined data set, we’ll bring this label back later to see how
well we can cluster the wine into groups.
Data: http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/
1 df1 <- read . csv ( ’ winequality - red . csv ’ , sep = ’; ’)
2 df2 <- read . csv ( ’ winequality - white . csv ’ , sep = ’; ’)
Now add a label column to both df1 and df2 indicating a label ’red’ or ’white’.
1 # Using sapply with anon functions
2 df1 $ label <- sapply ( df1 $ pH , function ( x ) { ’ red ’ })
3 df2 $ label <- sapply ( df2 $ pH , function ( x ) { ’ white ’ })
Combine df1 and df2 into a single data frame called wine
1 wine <- rbind ( df1 , df2 )
We can see that red is easier to cluster together. There seems to be a lot of noise with white
wines, this could also be due to "Rose" wines being categorized as white wine, while still retaining
the qualities of a red wine.
It’s important to note here, that K-Means can only give you the clusters, it can’t directly
tell you what the labels should be, or even how many clusters you should have, we are just
lucky to know we expected two types of wine. This is where domain knowledge really comes
into play. We can also view our results by using fviz_cluster. This provides a nice illustration of
the clusters. If there are more than two dimensions (variables) fviz_cluster will perform principal
component analysis (PCA) and plot the data points according to the first two principal components
that explain the majority of the variance.
1 library ( factoextra )
2 fviz _ cluster ( wine . cluster , data = wine [1:12])