Basic Data Science With R
Basic Data Science With R
Basic Data Science With R
WITH
Day 1
Data Science
Source: IBM
R PROGRAMMING LANGUAGE
Introduction
6
Strengths
• R is FREE!
• R runs on almost any standard computing platform
and operating system
• R has many other statistical packages (even today) is
its sophisticated graphics capabilities
• R is widely used and is supported by active and
vibrant user community
7
Limitations
8
R Programming Language
10
INSTALLING R
Installing R on Windows
• Go to http://cran.r-
project.org and select
Windows.
13
RStudio Screenshot
14
RStudio – Workspace tab
15
RStudio – History Tab
• The history tab keeps a record
of all previous commands. It
helps when testing and running
processes. Here you can either
save the whole list or you can
select the commands you want
and send them to an R script to
keep track of your work.
16
Rscript
20
Help with a Function I
21
Help with a Function II
22
Help with a Package I
23
Searching for Help I
24
0930 - 1000
R DATA TYPES AND
OBJECTS
Creating Variables I
> 3+9
# [1] 12
26
Creating Variables II
# Approach 1
a=10
a #or just type print(a)
[1] 10
# Approach 2
b <-10
b
[1] 10
27
Creating Variables III
Caution!
• Be careful when using <- to compare a variable with
a negative number!
# Assign a value to a
a <- -2
# Is a less than -5?
a <-5
a
[1] 5 # Expected FALSE
28
Creating Variables IV
a <- 5
a < -2
[1] FALSE
29
Creating Variables V
Caution!
• It is important not to name your variables after
existing variables or functions. For example, a bad
habit is to name your data frames data. data is a
function used to load some datasets.
• If you give a variable the same name as an existing
constant, that constant is overwritten with the value of
the variable. So, it is possible to define a new value
for π .
30
Creating Variables VI
Caution!
• On the other hand, if you give a variable the same name as
an existing function, R will treat the identifier as a variable if
used as a variable, and will treat it as a function when it is
used as a function:
31
Creating Variables VII
Caution!
• As we have seen, you can get away with using the
same name for a variable as with an existing
function, but you will be in trouble if you give a name
to a function and a function with that name already
exists.
32
OBJECTS
Objects in R
34
Objects in R: Integer, Real Number, NaN
35
Objects in R: Hands-on (Try This)
> y <- 10
> class(y)
36
Creating Vectors I
37
Creating Vectors II
38
Creating Vectors II
39
Some Useful Vector Functions I
40
Some Useful Vector Functions II
41
Some Useful Vector Functions III
g <-c(2, 6, 7, 4, 5, 2, 9, 3, 6, 4, 3)
sort (g, decreasing = TRUE )
[1] 9 7 6 6 5 4 4 3 3 2 2
42
Some Useful Vector Functions IV
43
Some Useful Vector Functions V
44
Some Useful Vector Functions VI
Caution!
mean(a)
[1] NA
mean(a, na.rm=TRUE)
[1] 1.5
45
Some Useful Vector Functions VII
46
Some Useful Vector Functions VIII
47
Comparison in R
48
1000 - 1040
READING AND WRITING
DATA
Data from the Internet I
50
Data from the Internet II
51
Importing Data from Your Computer I
52
Using Data Available in R I
• Extract the dataset you want from that package, using the
data()function. In our case, the dataset is called airquality
> data(airquality)
53
Working with Datasets in R II
54
Working with Datasets in R III
55
Working with Datasets in R VI
• To get the mean of Ozone variable in the dataset, use
mean():
56
Morning Break
Please be back by 11:00 am
1100 - 1200
[[2]]
[1] "a"
[[3]]
[1] TRUE
[[4]]
[1] 1+4i
59
Creating Matrices I
60
Some Useful Matrix Functions I
61
Some Useful Matrix Functions II
62
Some Useful Matrix Functions III
63
Some Useful Matrix Functions IV
64
Creating Matrices from Vectors I
65
Creating Matrices from Vectors II
66
Names for matrices
• Matrices can have both column and row names.
67
Data Frame
68
Names
• R objects can have names, which is very useful for writing
readable code and self-describing objects. Here is an example
of assigning names to an integer vector.
> names(x)
[1] "New York" "Seattle" "Los Angeles“
69
Names
• Lists can also have names, which is often very useful.
> x <- list("Los Angeles" = 1, Boston = 2, London =
3)
> x
$`Los Angeles`
[1] 1
$Boston
[1] 2
$London
[1] 3
> names(x)
[1] "Los Angeles" "Boston" "London"
70
1200 - 1300
SUB-SETTING DATA
Subsetting with Vectors I
72
Subsetting with Vectors II
73
Subsetting with Vectors III
74
Subsetting with Vectors IV
75
Subsetting with Vectors V
• We can use subsetting to explicitly tell R what observations we
want to use. To get all elements of d greater than or equal to 2,
d[d >= 2]
[1] 3 5 7 9
76
Day 1 Exercise 1
77
Sub-setting with Matrices I
78
Sub-setting with Matrices II
79
Sub-setting with Matrices III
80
Exercise 2
81
Lunch
Please be back by 2:00 pm
1400 - 1500
CONTROL STRUCTURES
Control Structures
84
Control Structures
• If-condition
if (<condition>) {
## do something
}
## Continue with rest of code
• If-else condition
if(<condition>) {
## do something
}
else {
## do something else
}
85
For loops (Basic)
for (i in 1:10) {
print(i)
cat(“Hello World\n”)
}
86
For Loop
for(i in for(i in
seq_along(x)){ 1:length(x)){
print(x[i]) print(x[i])
} }
87
For Loop: Hands-on
• Use for loop to list all the files in the a folder (you
can choose any folder).
Steps:
1. Read all the files in the folder – list.files()
2. Use for loop to iterate through the files. Print one
file at a time.
88
For Loop: Hands-on
89
While loops
count <- 0
while(count < 10) {
print(count)
count <- count + 1
}
90
Exercise 3
• Just a short exercise to use loops to compute a
multiplication timetable and store it into a data frame, or
vector. E.g. only store the answers.
[1] 1
[1] 2
[1] 3
[1] 2
[1] 4
[1] 6
[1] 3
[1] 6
[1] 9
91
Loop example (Harder)
• If we call the duplicated() function, e.g.
x <- c(3, 5, 7, 2, 9, 4, 3, 2, 8, 5,
2)
tf <- duplicated(x)
• If we want to extract the values that have duplicates,
for(i in 1:length(tf)) {
if (tf[i]==T) {
print(x[i])
}
}
92
Date and Time
93
Date and Time: Hands-on
> today <- "3/6/2015"
> class(today)
> today.date<-as.Date(today,"%d/%m/%Y")
> today.date
> class(today.date)
> unclass(today.date)
Try This!
94
Date and Time: Hands-on
95
1540 - 1600
CREATING AND USING
FUNCTIONS
Functions
• Writing functions is a core activity of an R programmer. It
represents the key step of the transition from a mere “user” to a
developer who creates new functionality for R.
• Functions are defined using the function() directive and are
stored as R objects just like anything else. In particular, they are
R objects of class “function”.
f <- function() {
## This is an empty function
}
97
Functions
f <- function(num) {
for(i in seq_len(num)) {
cat("Hello, world!\n")
}
}
• Now run f(3)
98
The paste function
• The paste function is to paste two or more objects together.
• E.g.,
> paste("a", "b", sep=" ")
> “a b”
> paste("a", "b", sep="***")
> “a***b”
> paste("a", "b", "c", sep="***")
> “a***b***c”
99
Argument Matching
100
1600 - 1700
LABORATORY EXERCISE
BASIC DATA SCIENCE
WITH
Day 2
• This is used in R
105
Dynamic Scoping
#!/bin/bash
x=10
function f {
x=$(($1 * $1))
echo $x
}
What is printed?
f $x
100
echo $x 100
106
Lexical Scoping (R)
107
Date and Time
108
Date and Time: Hands-on
> today <- "3/6/2015"
> class(today)
> today.date<-as.Date(today,"%d/%m/%Y")
> today.date
> class(today.date)
> unclass(today.date)
Try This!
109
Date and Time: Hands-on
110
Date: the lubridate package
111
Date
%a – weekday
%d – day number
• To retrieve the date of today,
%b month
> d1 <- Sys.Date() %y - year
112
Date
> weekdays(z)
> months(z)
> julian(z)
113
Exercise 1
• Use the airquality dataset
• Noticed that there are two variables, month and day
• Use paste() to combine them and store in a new
vector
• Convert it to a data format, (hint: use as.Date())
114
1000 – 1040
USING THE R “APPLY”
FUNCTIONS
Looping Functions
116
lapply()
117
lapply()
## Output
$a
[1] 3
$b
[1] 0.1322028
118
lapply()
119
sapply()
121
apply()
122
apply()
123
apply()
124
tapply()
• tapply lets you iterate over a data type called factor.
Therefore, good for grouping purpose.
• E.g,
> name<-c("Tan","Tan","Tan","Lee","Lee","Lee")
> subject<-c("IT","CS","AI","IT","CS","AI")
> marks<-c(90,95,80,90,99,85)
> df<-data.frame(name,subject,marks)
125
tapply()
126
Short Exercise
127
Looping …
128
Try some of the following commands
129
• Load the dataset named “mtcars”.
> library(datasets)
> data(mtcars)
> ?mtcars
130
Try some of the following commands
131
Morning Break
Please be back by 11:00 am
1100 – 1200 PM
RESHAPING DATA, SUB-SETTING
OBSERVATIONS & VARIABLES,
SUMMARIZING DATA.
Sub-setting Data
> set.seed(1)
> x <- data.frame("var1"=sample(1:5),
"var2"=sample(6:10),"var3"=sample(11:15))
> x$var2[c(1,3)]=NA
> x
134
Subsetting Data: Hands-on
> x[,1]
> x[1:2, “var2”]
> x[x$var1<=3 & x$var3 >10,]
> x[x$var1>2 | x$var3 >10,]
135
Subsetting Data: Hands-on
Data for var1
> x[x$var2>1,] and var3 not
presented
correctly for
row with NA
136
Subsetting Data: Hands-on
> x[which(x$var2>1),]
137
Subsetting Data: Hands-on
138
Sorting Data
139
Sorting Data
140
Sorting Data: Hands-on
> x[order(x$var1),]
141
Sorting Data: plyr package
> library(plyr)
> arrange(x, var1)
> arrange(x, desc(var1))
142
Quantile
143
Merging data
> set.seed(2)
> x<- sample(1:20,10)
> y<- sample(30:50,10)
> dt.2 <- data.frame(x,y)
144
Merging data
146
Merging Data: Hands-on
Task 1:
Return only the rows in which the df1 have
matching keys in df2.
147
Merging Data: Hands-on
Task 2:
Return all rows from the df1, and any rows with
matching keys from df2.
148
Merging Data: Hands-on
Task 3:
Return all rows from the df2, and any rows with
matching keys from df1.
149
LABORATORY EXERCISE –
GETTING, FORMATTING AND
STORING DATA
Downloading a File
• Go to http://www.data.gov.my/data/dataset/b5cd948f-cffb-
4439-ae08-e508ff073a93/resource/1f2d5629-ac8d-449a-a4c1-
9d269d625d84/download/lokalitihotspot2015.xlsx
and download dengue hotspot for 2015
151
Downloading a File
> url<-
"http://www.data.gov.my/data/dataset/b5cd948f-
cffb-4439-ae08-e508ff073a93/resource/1f2d5629-
ac8d-449a-a4c1-
9d269d625d84/download/lokalitihotspot2015.xlsx"
> download.file(url,"dengue.xlsx",mode='wb')
152
Downloading a File
153
Column Name Manipulation
fixed=TRUE to
make sure gsub
treats “.” and a
• Remove dot(.) in all the field names dot, not a
function
> names(dt) <- gsub(".", "", names(dt),
fixed=TRUE)
154
Column Name Manipulation
Check the
changes to
the columns
155
Column Name Manipulation
156
Records Manipulation
• Inspect the data for location, you can see there are many
wrong spellings. Say, if we want to replace “Tmn” to “Taman”
for all data in Location
> dt$Location <-gsub(“Lndah,
“Indah”, dt$Location)
Check the
Try replacing all “Kg” to “Kampung” data first
before
replacing
anything
157
String split
158
Finding Values
> grep("Taman",dt$Location)
> grep("Taman",dt$Location, value=TRUE)
How many of
them? What
command to
use?
159
Finding Values
> cnt<-table(grepl("Taman",dt$Location))
> barplot(cnt)
What do
you get?
Explain…
160
Finding Values
• Now, store all the data with the word “Taman” in Location in a
variable named dt_Taman.
161
Finding Values
• Now, store all the data with the word “Seksyen” or “Medan”
in Location
162
Finding Values
163
Exercise 2 (re-read the dengue csv file)
• What is the command to split data according to
“Negeri”?
• What is the command to split “Jumlah.Kes.Terkumpul”
according to “Negeri”?
• Apply a function across elements of the list
• How to get the sum of Jumlah.Kes.Terkumpul for each
Daerah.Zon.PBT?
164
Homework
[estimation = 20 minutes]
165
Lunch Break
Please be back by 2:00 pm
1500 – 1540
GETTING DATA FROM
DIFFERENT SOURCES
Downloading files
if (!file.exists("data")){
dir.create("data")
}
168
Read MySQL
169
UCSC MySQL
170
Demo (connect and read tables)
171
Basic RMySQL commands
• ucscDb<-dbConnect(MySQL(), user="genome",
host="genome-mysql.cse.ucsc.edu") #to
connect to MySQL with username and password
• dbGetQuery #execute and fetch SQL queries
• dbListTables #list all tables
• dbListFields #list all fields for table
• dbReadTable #read content from table
• dbSendQuery #execute SQL queries
• dbDisconnect #disconnect connection
172
Read from the Web
173
Example Webpage
174
Read from the Web
> con= url("http://pesona.mmu.edu.my/~ccho")
> htmlCode=readLines(con)
> close(con)
> htmlCode
175
Exercise 3
176
Further Resources – Read from the Web
177
Downloading files
fileURL <-
"http://www.google.com/finance
/historical?q=NASDAQ%3AGOOGL&e
i=hpYFWJuSIZGHuwSKg43wDA&outpu
t=csv"
download.file(fileURL,
destfile=(“alphabet.csv"))
list.files()
178
Exercise 4
179
Read local files (Revision)
180
Exercise
181
read.csv () and read.csv2 ()
• read.csv and read.csv2 are identical to read.table except for the
defaults.
• They are intended for reading ‘comma separated value’ files (‘.csv’) or
(read.csv2) the variant used in countries that use a comma as decimal
point and a semicolon as field separator.
182
Exercise 5
183
Reading Excel Files
• library (xlsx) package
rowIndex - a numeric vector indicating the rows you want to extract. If NULL,
all rows found will be extracted, unless startRow or endRow are specified.
colIndex - a numeric vector indicating the cols you want to extract. If NULL,
all columns found will be extracted.
• read.xlsx2 is much faster but can be unstable
• in general it is advised to store data in either .csv or .txt format for easier
to distribute
184
Exercise 6
• Download the Excel spreadsheet on Natural Gas Acquisition
Program here:
https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FDA
TA.gov_NGAP.xlsx
185
Reading from XML
• http://en.wikipedia.org/wiki/XML
186
Tags, elements and attributes
187
Example XML file
188
Read the XML file into R
189
Directly access parts of the XML
document
> rootNode[[1]]
> rootNode[[1]][[1]]
190
Programatically extract parts of the file
> xmlSApply(rootNode,xmlValue)
191
XPath
192
Get the items on the menu and prices
193
Reading JSON
194
Example of JSON file
195
Afternoon Break
Please be back by 4:00 pm
1600 - 1700
LABORATORY
EXERCISE
BASIC DATA SCIENCE
WITH
Day 3
201
Principle of Analytic Graphics
202
Plotting systems in R
203
{base} graphics system
• Common main plotting functions.
• hist(), barplot(), boxplot(), plot()
204
Demo of {base} plotting system
•Annotation using
• points(), lines(), text()
205
Some Important Base Graphics
Parameters
• Many base plotting functions share a set of parameters. Here
are a few key ones:
• pch: the plotting symbol (default is open circle)
• lty: the line type (default is solid line), can be dashed,
dotted, etc.
• lwd: the line width, specified as an integer multiple
• col: the plotting color, specified as a number, string, or hex
code; the colors() function gives you a vector of colors by
name
• xlab: character string for the x-axis label
• ylab: character string for the y-axis label
206
col and pch
R colour chart:
http://research.stowers-
institute.org/efg/R/Color/
Chart/ColorChart.pdf
207
Line types
208
Some Important Base Graphics
Parameters
• The par() function is used to specify global graphics
parameters that affect all plots in an R session. These
parameters can be overridden when specified as arguments
to specific plotting functions.
• las : the orientation of the axis labels on the plot
• bg : the background color
• mar : the margin size
• oma : the outer margin size (default is 0 for all sides)
• mfrow : number of plots per row, column (plots are filled
row-wise)
• mfcol : number of plots per row, column (plots are filled
column-wise)
209
Plots
ylim main
library(datasets)
hist(warpbreaks$b
reaks,
break breaks=20,
s xlab =
"Breaks",
xlim main="Number
ylab Breaks in Yarn
during Weaving",
xlab
ylim =
c(0,20))
211
Basic Line Plot
library("MASS")
data("cats")
plot(cats$Bwt, cats$Hwt,
type="l",
text() col="red",
lwd=1,
ylab="Heart weight
(Kg)",
xlab="Body weight
(Kg)",
main="Anatomical
abline(fit1 features of house cats")
type ) fit1<-lm(formula= cats$Hwt ~
cats$Bwt)
lwd abline(fit1, lty="dashed")
col #sample of text to be placed
in plot
text(x=2.3, y=18,
labels="R2=0.896\n P=2615e-
15")
212
Basic Scatterplot
library(datasets)
legen plot(iris$Sepal.Length,
d() iris$Petal.Length,
col=iris$Species,
pch=16,
cex=0.5,
xlab="Sepal Length",
col ylab="Petal Length",
pch main="Flower
Characteristics in Iris")
cex legend(x=4.2, y=7,
legend=levels(iris$Species),col
=c(1:3), pch=16)
213
Basic Boxplot
library(datasets)
boxplot(iris$Sepal.Length ~ iris$Species,
ylab="Sepal Length",
xlab="Species",
main="Sepal Length by Species in Iris")
214
{base} graphics system – par()
• Panel plotting
• 1 by 2
• 2 by 2
• 3 by 1
• par()
• mfrow()
e.g. par(mfrow=c(2,2))
215
Par () margins
e.g.par (mar=c(3,4,3,4),
oma=c(3,4,3,4)) 216
Exercise 1
20 minutes
217
Exercise 2a (plot disp ~ mpg)
10 minutes
218
Exercise 2b (plot disp ~ mpg)
40 minutes
• Main title
• Plotting symbols type, size
and colour
• X and y axes labels
• X and y axes limits
• Include absline() that is
dashed, red thicker
219
{base} graphics system – output
220
{base} graphics system - output
221
Graphics Device
• Computer Screen •NOT
• Input (such as Mouse &
Keyboard)
• File System
• Network Connection
Bitmap vs Vector Graphics
• Bitmap • Vector
• BMP • PS
• JPG / JPEG • EPS
• TIFF • SVG
• GIF
• PNG • Good for resizing, scaled
plots.
• Better for point kind of
plots such as scatter plots
and density plots.
Morning Break
Please be back by 11:00 am
1100 - 1200
INTRODUCTION TO LATTICE
PLOT
Lattice plot
226
Lattice plot format
• graph_type(formula, data=)
# Lattice Examples
library(lattice)
attach(mtcars)
228
Try it yourself
229
Try it yourself
# kernel density plots by factor level (alternate layout)
densityplot(~mpg|cyl.f,
main="Density Plot by Numer of Cylinders",
xlab="Miles per Gallon",
layout=c(1,3))
230
Customizing Lattice plots
231
Customizing Lattice plots
232
Lunch Break
Please be back by 2:00 pm
1400 - 1500
INTRODUCTION TO GG PLOT 2
{ggplot2} graphics system
• Created by Hadley Wickham
• Based on Leland Wilkinson's
The Grammar of Graphics.
• Widely used because
considered the best R
package for static
visualization.
• Package = ggplot2.
• The main function is
ggplot().
{ggplot2} introduction
install.packages("ggplot2")
library(ggplot2)
head(mtcars)
236
{ggplot2} plot layers
library(ggplot2)
g <- ggplot(data = mtcars, aes(x = hp, y
= mpg))
print(g)
237
{ggplot2} Scatterplots
ggplot(data = mtcars, aes(x = hp, y = mpg)) + geom_point()
238
Exercise 3: Adding colours to the plot
Modify the
content of the
aes() to include
colours and
produce the
following
graph.
ggplot(data = mtcars, aes(x = hp, y = mpg))
+ geom_point()
239
Exercise 4: Adding legend label to the
plot
Include the
appropriate
parameters
here (hint:
vectors)
+ scale_color_discrete(labels = )
240
Exercise 5: Re-labelling
labs(color = "Transmission", ? ? )
241
{ggplot2} Scatterplots
• Add different themes to
your plots
• theme_bw()
• theme_light()
• theme_minimal()
• theme_classic()
> head(diamonds)
244
Exercise 6a: {ggplot2} Bar Charts
• Using the “diamond”
data and geom_bar()
geom_line()
248
{ggplot2} Line Charts
set.seed(1)
time <- 1:10
income <-
runif(10,1000,5000)
dt <-
data.frame(time,income)
249
{ggplot2} Line Charts
• Using the dataset created
(previous slide)
set.seed(1)
time<-rep(1:5, each=2)
income <- runif(10,1000,5000)
dt <- data.frame(time,income,type)
ggplot(dt, aes(x = time, y = income, colour = type)) +
geom_smooth()
253
1500 - 1540
GGPLOT2 EXERCISE
Exercise 7: data("msleep")
255
Afternoon Break
Please be back by 4:00 pm
1600 - 1700
LABORATORY EXERCISE –
COLOURS AND PLOTS
Plotting and Colour in R
x <- rnorm(10000)
y <- rnorm(10000)
plot(x,y)
smoothScatter(x,y)
Experiment with Colours
hist(mtcars$mpg, col.lab="red")
hist(mtcars$mpg, col.lab=552)
Experiment with Colours
Experiment with Colours
colors()[c(552,254,26)]
[1] "red" "green" "blue"
grep("red",colors())
[1] 100 372 373 374 375 376 476 503 504 505 506 507 524 525 526
[16] 527 528 552 553 554 555 556 641 642 643 644 645
Experiment with Colours
colors()[grep("red",colors())]
[1] "darkred" "indianred" "indianred1" "indianred2"
[5] "indianred3" "indianred4" "mediumvioletred" "orangered"
[9] "orangered1" "orangered2" "orangered3" "orangered4"
[13] "palevioletred" "palevioletred1" "palevioletred2" "palevioletred3"
[17] "palevioletred4" "red" "red1" "red2"
[21] "red3" "red4" "violetred" "violetred1"
[25] "violetred2" "violetred3" "violetred4"
Experiment with Colours
colors()[grep("sky",colors())]
[1] "deepskyblue" "deepskyblue1" "deepskyblue2" "deepskyblue3"
[5] "deepskyblue4" "lightskyblue" "lightskyblue1" "lightskyblue2"
[9] "lightskyblue3" "lightskyblue4" "skyblue" "skyblue1"
[13] "skyblue2" "skyblue3" "skyblue4"
Experiment with Colours
with(airquality, plot(Wind,Ozone, type="n"))
with(subset(airquality,Month==5), points(Wind, Ozone, col="blue"))
with(subset(airquality,Month!=5), points(Wind, Ozone, col="red"))
legend("topright",pch=1,col=c("blue","red"),legend = c("may","other
months"))
Experiment with Colours
x <- rnorm(10000)
hist(x,col="blue")
mycolor<-c("blue","red","green")
hist(x,col=mycolor)
Experiment with Colours
•Define your own colour
library(RColorBrewer)
display.brewer.all()
• RColorBrewer works a
little different than how
we’ve defined palettes
previously. We’ll have to
use brewer.pal to create
a palette.
Experiment with Colours
library(RColorBrewer)
darkcols <- brewer.pal(8, "Dark2")
hist(x, col = darkcols)
Exercise 7b: data("msleep")
274
BASIC DATA SCIENCE
WITH
Day 4
278
Markdown Syntax
• Headings
# This is Heading 1
## This is Heading 2
### This is Heading 3
• Italics
*This is italic*
• Bold
**This is Bold**
279
Markdown Syntax
Unordered Lists
-First item
-Second item
-Third item
280
Markdown Syntax
Ordered Lists
1. First item
2. Second item
3. Third item
281
Markdown Syntax
282
Markdown Syntax
•Newlines
• Make sure you use a double space to create a
new line.
283
R Markdown
284
Knitr : Strengths vs Weaknesses
Strengths Weaknesses
Text and code all in one place Difficult to read as code and
text jumbled up
Results updated automatically Processing of code,
particularly loading of large
files, can take time
Code is live! No code, no document
Good For: Not good for:
• Manuals • Long research articles
• Short-to-Medium Documents • Documenting complex
• Tutorials processes
• Reports • Documents with precise
• Data processing summaries formatting 285
R Markdown: Hands-on
286
R Markdown: Hands-on
288
If you don’t like clicking on buttons…
library(knitr)
knit2html("document.Rmd")
browseURL("document.html")
289
knitr notes
• Code chunks begin with ```{r} and end with ```
• All R code goes in between these markers
• Code chunks can have names, which is useful when we start making graphics
```{r firstchunk}
## R code goes here
```"
• By default, code in a code chunk is echoed, as will the results of the
computation (if there are results to print)
• Three things that happens when you knit
• You write the RMarkdown document (with a .Rmd extension)
• knitr produces a Markdown document (with a .md extension)
• knitr converts the Markdown document into HTML (by default)
• You should NOT edit (or even save) the .md or .html documents until you are
finished; these documents will be overwritten the next time you knit the .Rmd
file. 290
R Markdown: Hands-on
```{r}
library(datasets)
library(ggplot2)
str(mtcars)
```
```{r}
ggplot(mtcars, aes(wt, mpg)) +
geom_point(aes(colour=factor(cyl), size
= qsec))
``` 291
R Markdown: Hands-on
```{r}
library(datasets)
library(ggplot2)
str(mtcars)
```
```{r}
ggplot(mtcars, aes(wt, mpg)) +
geom_point(aes(colour=factor(cyl), size
= qsec))
``` 292
R Markdown: Hands-on
293
R Markdown: Hands-on
```{r echo=FALSE}
ggplot(mtcars, aes(wt, mpg)) +
geom_point(aes(colour=factor(cyl), size
= qsec))
```
294
R Markdown: Hands-on
```{r echo=FALSE}
ggplot(mtcars, aes(wt, mpg)) +
geom_point(aes(colour=factor(cyl), size = qsec))
```
295
Converting R Markdown to a
presentation
---
title: "Habits"
author: "John Doe"
date: "March 22, 2005"
output: beamer_presentation OR output:
ioslides_presentation
OR output: beamer_presentation
---
297
Converting R Markdown to a
presentation
• Or you can choose the correct drop down option
from R Studio
298
Rpubs
•You may submit your
output to a free R
Markdown hosting
website by Rstudio
named Rpubs.
•Navigate to
www.rpubs.com
299
Exercise
300
Morning Break
Please be back by 11:00 am
Introduction to the Caret package
What is Caret?
• useful set of front-end tools /
wrapper
• caret.r-forge.r-project.org
• Functionality
• Some preprocessing (cleaning):
preProcess
• For data splitting:
createDataPartition, createResample,
createTimeSlices
• Some training/testing functions: train,
predict
• Tool for model comparison:
confusionMatrix 302
ML techniques supported by R
306
Data Slicing
307
Training Options
• The default setting for training any model can be seen
by using the
args(train.default) command, including those for the
error/performance metrics
• Metric options
• Continuous outcomes: RMSE=root mean squared error,
RSquared = R^2 from regression models
• Categorical outcomes: Accuracy = fraction correct, Kappa =
measure of concordance
• Default will follow from whether variable is a factor or not
308
Training options for Resampling
309
Plotting Predictors
Plots Usage
Featureplot (caret) Shows scatter plot of predictors by classes, to show
correlation
Qplot (ggplot) Shows scatter plot of predictors by classes (using
colour), to show correlation. Possible to add
regression smoothers (to show regression).
Also able to plot density plot
Cut2(Hmisc) Convert numeric variables into intervals (factors).
Also can plot boxplot (show distribution of data)
table Tabular results of categorical data 310
Notes on plotting predictors
311
Preprocessing
312
How to standardize (preProcess function)
314
How to handle missing values
315
On pre-processing
316
Covariate creation
317
Level 1
• Level 1, Raw data to covariates
• Depends heavily on application
• The balancing act is summarization vs info loss
• Examples:
• * text files: freq of words, phrases (Google ngrams), capital letters
• * images: edges, corners, blobs, ridges (computer vision feature detection)
• * webpages: number and type of images, position of elements, colors,
videos (A/B testing, aka randomized trials in statistics)
• * people: height, weight, hair color, sex, country of origin
• The more knowledge of the system you have, the better the job
you will do
• When in doubt, err on the side of more features
• It can be automated, but use caution! May be important for
training but won't generalize well for test
318
Level 2
319
Additional steps for covariates
320
Level 1 and Level 2 Covariates
discussion
• Level 1 feature creation (raw to covariates):
• Science is key; Google 'feature extraction for [data type]' eg images,
voice, etc.
• Err on overcreation of features
• In some applications (images, voices), automated feature creation is
possible / necessary
• Level 2 feature creation (covariates to new cov)
• Function preProcess in caret will handle some preprocessing
• Create new covariates if you think they will improve fit
• Use exploratory analysis on the training set for creating them
• Be careful about overfitting!
• Preprocessing with caret, see instructions
321
Preprocessing with PCA (principal components analysis)
322
What of multivariate variables?
• Related problems:
• Multivariate variables X1, ... Xn so X1 = (X11, ... X1m)
• Find a new set of multivariate vars that are uncorrelated and
explain as much variance as possible (in example, use X and
throw out Y and original vars)
• If you put all the vars together in one matrix, find the best
matrix created with fewer variables (lower rank) that explains
the original data
• The first goal is statistical, and the 2nd is data compression
323
PCA/SVD
324
Principal components in R: prcomp
325
Final ideas on PCA
326
Prediction with regression
•Key ideas
•Fit a simple regression model
•Plug in new covariates and multiply by the
coefficients
•Useful when the linear model is nearly correct
•Pros: easy to implement and interpret
•Cons: often poor performance in nonlinear
settings
327
Regression in R
332
Shiny showcase
https://www.rstudio.com/p
roducts/shiny/shiny-user-
showcase/
333
The architecture of Shiny
• Every Shiny app is maintained by a computer running
R
Server User
Instructions Interface (UI)
334
shinyapps.io
devtools::install_github("rstudio/sh
inyapps")
337
Create a shinyapps.io account
338
Configure the rsconnect package
339
App template
library(shiny)
ui <- fluidPage()
ui <- fluidPage(
# *Input() functions,
# *Output() functions
)
341
Input
342
Create an Input with an Input() function
sliderInput(inp
utId = "num",
label = "Choose
a number",
value=10, min =
1, max = 50)
343
Input syntax
sliderInput(inputId = "num", label = "Choose a
number", …)
Input name
Label to be Specific
(for internal
displayed arguments
usage)
344
Output
345
Output syntax
plotOutput("hist")
346
Server Function
output$hist
plotOutput("hist")
348
Server function
})
}
349
render() function
• Use the render*() function that creates the type of
output you wish to make.
renderPlot({ hist(rnorm(100)) })
350
Server function
})
}
351
Server function
})
}
sliderInput(inputId = "num",…)
input$num
352
Input values
•The input value changes whenever a user
changes the input.
input$num = 10
input$num = 20
input$num = 35
353
Reactivity in R as two steps process
1. Reactive
values notify input$num
the functions
that use them
when they
become invalid
2. The objects
renderPlot({
created by the
hist(rnorm(input$
reactive
num))})
functions
respond 354
Running and Deploying apps
356
Exercise 2: Customize your shiny page
library(shiny)
ui <- fluidPage(
titlePanel("title panel"),
sidebarLayout(
sidebarPanel( "sidebar
panel"),
mainPanel("main panel",
h1("First level title"))
)
)
357
Exercise 3: Add Input and Output
library(shiny)
ui <- fluidPage(
titlePanel("Page with Slider Input
Demo"),
sidebarLayout(
sidebarPanel(
sliderInput(inputId =
"num",label = "Choose a number", * Output()
value = 10, min = 1, max = 50) adds a space
in
), the ui for an
mainPanel("main panel", R object.
plotOutput("hist")
) You must
)) build the
object
server <- function(input, output) {} in the server
shinyApp(ui = ui, server = server) function
358
Exercise 4: Write the server function
library(shiny)
ui <- fluidPage(
titlePanel("Page with Slider Input
Demo"),
sidebarLayout(
sidebarPanel(
sliderInput(inputId = "num",label =
"Choose a number", value = 10, min =
1, max = 50)),
mainPanel("main panel",
plotOutput("hist")
)
))
server <- function(input, output) {
output$hist <- renderPlot({
hist(rnorm(input$num))
})
}
shinyApp(ui = ui, server = server)
359
6
0
The two files app: ui.R and server.R
library(shiny) # ui.R
library(shiny)
plotOutput("hist") ))
)
)) # server.R
library(shiny)
server <- function(input, output) {
server <- function(input, output) {
output$hist <-
renderPlot({hist(rnorm(input$num)) output$hist <-
}) renderPlot({hist(rnorm(input$num))
} })
}
shinyApp(ui = ui, server = server)
The two files app
361
Exercise 5: Layout
library(shiny)
ui <- fluidPage(
titlePanel("Page with Multi-tab Panel"),
sidebarLayout(
sidebarPanel(sliderInput(inputId = "num",label
= "Choose a number", value = 10, min = 1, max =
50)
),
mainPanel(
tabsetPanel(
tabPanel("Plot", plotOutput("hist")),
tabPanel("Summary",
verbatimTextOutput("summary"))
)
)))
server <- function(input, output) {
output$hist <-
renderPlot({hist(rnorm(input$num))})
output$summary<-
renderPrint({summary(rnorm(input$num))})
}
362
shinyApp(ui = ui, server = server)
Afternoon Break
Please be back by 4:00 pm
1600-1700
FINAL EXERCISE AND
WRAP-UP