Basic Data Science With R

Download as pdf or txt
Download as pdf or txt
You are on page 1of 364

BASIC DATA SCIENCE

WITH

Day 1

Universiti Sains Malaysia October 2016


Training Schedule: Day 1
DAY1
0900-1000 Introduction to R data types and objects
1000-1040 Reading and writing data
1040-1100 Morning Break
1100-1200 List, Matrix and Data Frame
1200-1300 Sub-setting data
1300-1400 Lunch
1400-1500 Control structures
1500-1540 Creating and using functions
1540-1600 Afternoon Break
1600-1700 Laboratory exercise
DATA SCIENCE
“Data science is the study of where information comes from, what
it represents and how it can be turned into a valuable resource in
the creation of business and IT strategies.”
Source: Wikipedia

Data Science

Source: IBM
R PROGRAMMING LANGUAGE
Introduction

• R is a programming language and software


environment for statistical computing and graphics.
• R runs on almost any standard computing platform
and operating system.
• R is an implementation of the S programming
language combined with lexical scoping semantics
• R is freely available under the GNU General Public
License.

6
Strengths

• R is FREE!
• R runs on almost any standard computing platform
and operating system
• R has many other statistical packages (even today) is
its sophisticated graphics capabilities
• R is widely used and is supported by active and
vibrant user community

7
Limitations

•R is essentially based on almost 50 year old


technology
•All objects in R must generally be stored in
physical memory
•The capabilities of the R system generally
dependent on the interests of the R user
community

8
R Programming Language

“The big mover is R, a statistical computing language that’s handy for


analyzing and visualizing big data, which comes in at sixth place. Last
year it was in ninth place, and its move reflects the growing importance of
big data to a number of fields.”
Source: The 2015 Top Ten Programming Languages
9
General tips to learn R Programming
• Keep tabs on the R community
• Read R bloggers, a blog aggregator that reposts R
related articles from across the web. R bloggers is a good
place to find R tutorials, announcements, and other random
happenings.

• Deepen your expertise


• To attain the ultimate R expertise, read Hadley
Wickham’s Advanced R Programming book, which is
available for free online at the link. Hadley explains in
clear prose how R works from a computer science
perspective.
http://www.rstudio.com/resources/training/online-learning/#R

10
INSTALLING R
Installing R on Windows
• Go to http://cran.r-
project.org and select
Windows.

• Select base to install the R


system.

• Click on the large download


link. There is other information
available on the page.

• Install and Open. The R


window should look like this:
12
RStudio
• RStudio is an integrated
development environment (IDE) for
R. It includes a console, syntax-
highlighting editor that supports
direct code execution, as well as
tools for plotting, history,
debugging and workspace
management

13
RStudio Screenshot

14
RStudio – Workspace tab

• The workspace tab stores any object, value, function


or anything you create during your R session. In the
example below, you can see the data on a screen to
the left.

15
RStudio – History Tab
• The history tab keeps a record
of all previous commands. It
helps when testing and running
processes. Here you can either
save the whole list or you can
select the commands you want
and send them to an R script to
keep track of your work.

• In this example, we select all


and click on the “To Source”
icon, a window on the left will
open with the list of commands.
Make sure to save the
‘untitled1’ file as an *.R script.

16
Rscript

The usual RStudio screen has four windows:


1. Console.
2. Workspace and history.
3. Files, plots, packages and help.
4. The R script(s) and data view.
The R script is where you keep a record of your work 17
Rscript
• To create a new R script you can either go to File -> New
-> R Script, or click on the icon with the “+” sign and select
“R Script”, or simply press Ctrl+Shift+N. Make sure to save
the script.

Here you can type R commands and run


them. Just leave the cursor anywhere on the
line where the command is and press Ctrl-R
or click on the ‘Run’ icon above. Output will
appear in the console below. 18
SETUP AND GETTING HELP
Setting up the Working Directory

• Always check your current working directory. You need that to


answer the assignments.
• Check your current working directory:
> getwd()

• Set your current working directory. E.g., Ian’s working directory is


“C:/Users/Ian/Desktop”
> setwd(" C:/Users/Ian/Desktop")

20
Help with a Function I

• To get help with a function in R, use ? followed by


the name of the function.
> ?read.table
• help(function name) also works.

21
Help with a Function II

• Use the help tool at the right panel of RStudio

22
Help with a Package I

• To get help with a package, use help(package="name").

> help (package="MASS")

23
Searching for Help I

• To search R packages for help with a topic, use help.search().

> help.search ("regression")

24
0930 - 1000
R DATA TYPES AND
OBJECTS
Creating Variables I

• To use R as a calculator, type an equation and hit


ENTER. (Note how R prints the result.) Your output
should look like this:

> 3+9
# [1] 12

26
Creating Variables II

• To create variables in R, use either <- or =

# Approach 1
a=10
a #or just type print(a)
[1] 10
# Approach 2
b <-10
b
[1] 10

27
Creating Variables III

Caution!
• Be careful when using <- to compare a variable with
a negative number!

# Assign a value to a
a <- -2
# Is a less than -5?
a <-5
a
[1] 5 # Expected FALSE

28
Creating Variables IV

• Use spaces so that R will not be confused.

a <- 5
a < -2
[1] FALSE

29
Creating Variables V

Caution!
• It is important not to name your variables after
existing variables or functions. For example, a bad
habit is to name your data frames data. data is a
function used to load some datasets.
• If you give a variable the same name as an existing
constant, that constant is overwritten with the value of
the variable. So, it is possible to define a new value
for π .

30
Creating Variables VI

Caution!
• On the other hand, if you give a variable the same name as
an existing function, R will treat the identifier as a variable if
used as a variable, and will treat it as a function when it is
used as a function:

c <- 2 #typing c yields "2"


c(c,c) #yields a vector containing two 2s.

31
Creating Variables VII

Caution!
• As we have seen, you can get away with using the
same name for a variable as with an existing
function, but you will be in trouble if you give a name
to a function and a function with that name already
exists.

32
OBJECTS
Objects in R

Five (5) basic classes of objects:


• Character How do I check
which class an
• Numeric (real numbers)
object belongs
• Integer to? Answer is
• Complex the class()
• Logical (True/False) command

34
Objects in R: Integer, Real Number, NaN

• A real number is any number you type in R


environment. E.g.,
> 10

• For integer, you need to specify the L suffix. E.g.,


> 10L

• NaN is defined as Not A Number

35
Objects in R: Hands-on (Try This)

> "hello" -> x


> class(x)

> y <- 10
> class(y)

> z <- 10L


> class(z)

36
Creating Vectors I

• Scalars are the most basic vectors. To create vectors


of length greater than one, use the concatenation
function c():

d <- c(3, 4, 7); d


[1] 3 4 7

*The semicolon ; is used to combine multiple


statements on one line.

37
Creating Vectors II

• To create a null vector:


x <- c(); x
NULL

• Creating a vector with equal spacing, use the sequence


function seq():
e <- seq(from = 1, to = 3, by =
0.5); e
[1] 1.0 1.5 2.0 2.5 3.0

38
Creating Vectors II

• Creating a vector of a given length, use the repeat


function rep():

f <- rep(NA , 6); f


[1] NA NA NA NA NA NA

39
Some Useful Vector Functions I

• To find the length of the vector, use length():


length (d)
[1] 3

• To find the maximum value of the vector, use the


maximum function max():
max(d)
[1] 7

40
Some Useful Vector Functions II

• To find the minimum value of the vector, use the


minimum function min():
min(d)
[1] 3

• To find the mean of the vector, use mean():


mean (d)
[1] 4.666667

41
Some Useful Vector Functions III

•To sort the vector, use sort():

g <-c(2, 6, 7, 4, 5, 2, 9, 3, 6, 4, 3)
sort (g, decreasing = TRUE )
[1] 9 7 6 6 5 4 4 3 3 2 2

42
Some Useful Vector Functions IV

• To find the unique elements of the vector, use unique():


unique (g)
[1] 2 6 7 4 5 9 3

• Alternatively, to find the elements of the vector that repeat,


use duplicated():
duplicated (g)
[1] FALSE FALSE FALSE FALSE FALSE
TRUE
[7] FALSE FALSE TRUE TRUE TRUE

43
Some Useful Vector Functions V

• To determine if a value is missing (NA), use is.na. This is useful


for finding missing values and removing them, or doing
something else with them.
a <- c(1 ,2 ,3 ,NA ,6)
is.na(a)
[1] FALSE FALSE FALSE TRUE FALSE

• But some functions do not tolerate missing values.

44
Some Useful Vector Functions VI

Caution!

mean(a)
[1] NA

mean(a, na.rm=TRUE)
[1] 1.5

45
Some Useful Vector Functions VII

• To get the number of missing values in a vector,


sum(is.na(a))
[1] 1

• There are other ways to handle missing values. See


?na.action.

46
Some Useful Vector Functions VIII

• One final common function you can use on vectors


(and other objects) is summary.
summary (a)

Min. 1st Qu. Median Mean 3rd Qu. Max.


1.00 1.75 2.50 3.00 3.75 6.00
NA's
1.00
• There are many, many other functions you can use on
vectors!

47
Comparison in R

48
1000 - 1040
READING AND WRITING
DATA
Data from the Internet I

•When downloading data from the internet,


use read.table(). In the arguments of the
function:

• header if TRUE, tells R to include variables names


when importing
• sep tells R how the entire in the data set are
separated
• sep="," when entries are separated by COMMAS
• sep="nt" when entries are separated by TAB
• sep=" " when entries are separated by SPACE

50
Data from the Internet II

stock.data <- read.table


("http://www.google.com/finance/historical?q=NASD
AQ:AAPL&output=csv", header =TRUE , sep =",")

51
Importing Data from Your Computer I

• Check what folder R is working with now:


getwd()
• Tell R in what folder the data set is stored (if different from
(1)). Suppose your data set is on your desktop:
setwd ("~/Desktop")
• Now use the read.table() command to read in the
data, substituting the name of the file for the website.

52
Using Data Available in R I

• To use a dataset available in one of the R packages, install


that package (if needed). Load the package into R, using the
library() function.
> library (datasets)

• Extract the dataset you want from that package, using the
data()function. In our case, the dataset is called airquality
> data(airquality)

53
Working with Datasets in R II

• After the variable names have been "attached", to


see the variable names, use names():
> names (airquality)

• To see the descriptions of the variables, use ?:


> ?airquality

54
Working with Datasets in R III

• To get an overview of the datasets and its variables,


use the summary() function:

1 # Check that summary of the


dataset
2 summary (airquality)

55
Working with Datasets in R VI
• To get the mean of Ozone variable in the dataset, use
mean():

mean (airquality$Ozone, na.rm=TRUE)


42.12931

56
Morning Break
Please be back by 11:00 am
1100 - 1200

LIST, MATRIX, DATA FRAME


List
• Lists are a special type of vector that can contain elements of
different classes.
• Lists can be explicitly created using the list() function, which
takes an arbitrary number of arguments.
x <- list(1, "a", TRUE, 1 + 4i)
> x
[[1]]
[1] 1

[[2]]
[1] "a"

[[3]]
[1] TRUE

[[4]]
[1] 1+4i
59
Creating Matrices I

• To create a matrix, use the matrix() function:


1 mat <- matrix (10:15 , nrow =3, ncol
=2); mat
[ ,1] [ ,2]
[1 ,] 10 13
[2 ,] 11 14
[3 ,] 12 15

60
Some Useful Matrix Functions I

• To add two matrices, use +


mat+mat
[ ,1] [ ,2]
[1 ,] 20 26
[2 ,] 22 28
[3 ,] 24 30

61
Some Useful Matrix Functions II

• To find the transpose of a matrix, use t():


t(mat)
[ ,1] [ ,2] [ ,3]
[1 ,] 10 11 12
[2 ,] 13 14 15

62
Some Useful Matrix Functions III

• To find the dimensions of a matrix, use dim():


dim(mat)
[1] 3 2
• Alternatively, we can find the rows and columns of
the matrix, by nrow()and ncol().

63
Some Useful Matrix Functions IV

• To multiply two matrices, use %*%.


• Note: If you use * instead, you will be performing
matrix multiplication element-wise.
mat%*%(mat)
[ ,1] [ ,2] [ ,3]
[1 ,] 269 292 315
[2 ,] 292 317 342
[3 ,] 315 342 369

64
Creating Matrices from Vectors I

• To stack two vectors, one below the other, use rbind():


mat1 <- rbind (d,d); mat1
[ ,1] [ ,2] [ ,3]
d 3 4 7
d 3 4 7

65
Creating Matrices from Vectors II

• To stack two vectors, one next to the other, use cbind():


mat2 <- cbind (d,d); mat2
d d
[1 ,] 3 3
[2 ,] 4 4
[3 ,] 7 7

66
Names for matrices
• Matrices can have both column and row names.

> m <- matrix(1:4, nrow = 2, ncol = 2)


> dimnames(m) <- list(c("a", "b"), c("c", "d"))
> m
c d
a 1 3
b 2 4
• Column names and row names can be set separately using the
colnames() and rownames() functions.
> colnames(m) <- c("h", "f")
> rownames(m) <- c("x", "z")
> m
h f
x 1 3
z 2 4

67
Data Frame

• Data frames are used to store tabular data in R. They are an


important type of object in R and are used in a variety of
statistical modelling applications.
• Data frames are represented as a special type of list where every
element of the list has to have the same length.
• Each element of the list can be thought of as a column and the
length of each element of the list is the number of rows.
• Unlike matrices, data frames can store different classes of objects
in each column. Matrices must have every element be the same
class (e.g. all integers or all numeric).

> x <- data.frame(foo = 1:4, bar =


c(T, T, F, F))

68
Names
• R objects can have names, which is very useful for writing
readable code and self-describing objects. Here is an example
of assigning names to an integer vector.

> x <- 1:3


> names(x)
NULL

> names(x) <- c("New York", "Seattle", "Los


Angeles")
> x
New York Seattle Los Angeles
1 2 3

> names(x)
[1] "New York" "Seattle" "Los Angeles“
69
Names
• Lists can also have names, which is often very useful.
> x <- list("Los Angeles" = 1, Boston = 2, London =
3)
> x
$`Los Angeles`
[1] 1
$Boston
[1] 2
$London
[1] 3

> names(x)
[1] "Los Angeles" "Boston" "London"
70
1200 - 1300
SUB-SETTING DATA
Subsetting with Vectors I

• To find out what is stored in a given element of the


vector, use [ ]:
d[2]
[1] 4

• To see if the elements of a vector equal a certain


number, use ==:
d==3
[1] TRUE FALSE FALSE

72
Subsetting with Vectors II

• To see if any of the elements of a vector do not


equal a certain number, use !=:
d != 3
[1] FALSE TRUE TRUE

73
Subsetting with Vectors III

• To obtain the element number of the vector when a condition


is satisfied, use which():
which (d==4)
[1] 2

• To store the result, type: a=which(d==4); a

74
Subsetting with Vectors IV

• We can also tell R what we do not want when


subsetting by using the minus - sign. To obtain
everything but the 2nd element,
d <- seq (1 ,10 ,2)
d[ -2]
[1] 1 5 7 9

75
Subsetting with Vectors V
• We can use subsetting to explicitly tell R what observations we
want to use. To get all elements of d greater than or equal to 2,
d[d >= 2]
[1] 3 5 7 9

• R will return values of d where the expression within brackets is


TRUE. Think of these statements as: “give me all d such that d ≥
2.”

76
Day 1 Exercise 1

• Create a vector of the positive odd integers less than 100


• Remove the values > than 60
• Remove the values < than 80
• Find and show values in between 60 and 80

77
Sub-setting with Matrices I

• To see what is stored in the first element of the


matrix, use [ ]:
mat [1 ,1]
[1] 10
• To see what is stored in the first row of the matrix:
mat [1 ,]
[1] 10 13

78
Sub-setting with Matrices II

• To see what is stored in the second column of the


matrix:
mat[, 2]
[1] 13 14 15

79
Sub-setting with Matrices III

• To extract elements 1 and 3 from the second column,


use c() and [ ]:
mat[c(1 ,3) , 2]
[1] 13 15

80
Exercise 2

• Find the matrix multiplication of A and B if

81
Lunch
Please be back by 2:00 pm
1400 - 1500

CONTROL STRUCTURES
Control Structures

• Control structures in R allow you to control the flow of


execution of a series of R expressions.
• Commonly used control structures are
• if and else: testing a condition and acting on it
• for: execute a loop a fixed number of times
• while: execute a loop while a condition is true
• repeat: execute an infinite loop (must break out of it to
stop)
• break: break the execution of a loop
• next: skip an iteration of a loop

84
Control Structures

• If-condition
if (<condition>) {
## do something
}
## Continue with rest of code

• If-else condition
if(<condition>) {
## do something
}
else {
## do something else
}

85
For loops (Basic)

for (i in 1:10) {
print(i)
cat(“Hello World\n”)
}

86
For Loop

A few versions of for loop

x <-c("a", "b", "c", "d")

for(i in 1:4){ for(letter in x){


print(x[i]) print(letter)
} }

for(i in for(i in
seq_along(x)){ 1:length(x)){
print(x[i]) print(x[i])
} }
87
For Loop: Hands-on

• Use for loop to list all the files in the a folder (you
can choose any folder).

Steps:
1. Read all the files in the folder – list.files()
2. Use for loop to iterate through the files. Print one
file at a time.

88
For Loop: Hands-on

> file_list <- list.files()

> for (file in file_list){


print(file)
}

89
While loops

count <- 0
while(count < 10) {
print(count)
count <- count + 1
}

90
Exercise 3
• Just a short exercise to use loops to compute a
multiplication timetable and store it into a data frame, or
vector. E.g. only store the answers.

[1] 1
[1] 2
[1] 3
[1] 2
[1] 4
[1] 6
[1] 3
[1] 6
[1] 9

91
Loop example (Harder)
• If we call the duplicated() function, e.g.

x <- c(3, 5, 7, 2, 9, 4, 3, 2, 8, 5,
2)
tf <- duplicated(x)
• If we want to extract the values that have duplicates,

for(i in 1:length(tf)) {
if (tf[i]==T) {
print(x[i])
}
}

92
Date and Time

• Dates are represented by Date class.


• Times are represented by the POSIXct and POSIXlt class
• Dates and times are stored internally as number since 1970-
01-01

93
Date and Time: Hands-on
> today <- "3/6/2015"
> class(today)
> today.date<-as.Date(today,"%d/%m/%Y")
> today.date
> class(today.date)
> unclass(today.date)

Try This!

94
Date and Time: Hands-on

If involves time, then POSIXct() should be used

> date1 <- as.POSIXct("2015-05-28


01:00:00")
> date2 <- as.POSIXct("2015-06-28
01:00:00")
> date1–date2

POSIXct stores the number of seconds since 1st January 1970

95
1540 - 1600
CREATING AND USING
FUNCTIONS
Functions
• Writing functions is a core activity of an R programmer. It
represents the key step of the transition from a mere “user” to a
developer who creates new functionality for R.
• Functions are defined using the function() directive and are
stored as R objects just like anything else. In particular, they are
R objects of class “function”.

f <- function() {
## This is an empty function
}

97
Functions

f <- function(num) {
for(i in seq_len(num)) {
cat("Hello, world!\n")
}
}
• Now run f(3)

• You can try to write a function called cube(), which takes a


value (num) and returns the cube of the value (num3)

98
The paste function
• The paste function is to paste two or more objects together.
• E.g.,
> paste("a", "b", sep=" ")
> “a b”
> paste("a", "b", sep="***")
> “a***b”
> paste("a", "b", "c", sep="***")
> “a***b***c”

99
Argument Matching

• R functions arguments can be matched positionally or


by name.

• Positional matching just means that R assigns the first


value to the first argument, the second value to
second argument, etc.
• In other words uses numbers as indices to the parameter.

100
1600 - 1700
LABORATORY EXERCISE
BASIC DATA SCIENCE
WITH

Day 2

Universiti Sains Malaysia October 2016


Training Schedule: Day 2
DAY2
0900-1000 Scoping rules, manipulating dates and times
1000-1040 Using the R “apply” functions
1040-1100 Morning Break
1100-1200 Reshaping data, sub-setting observations & variables,
summarizing data.
1200-1300 Laboratory exercise – Getting, format and store data.
1300-1400 Lunch
1400-1540 Getting data from different sources
1540-1600 Afternoon Break
1600-1700 Laboratory exercise
0900 – 1000
SCOPING RULES, MANIPULATING
DATES AND TIMES
Lexical Scoping
• Also known as static scoping (as oppose to dynamic
scoping)

• Sets the scope of a variable so that it may only be


called from within the block of code in which it is
declared.

• This is used in R

105
Dynamic Scoping

#!/bin/bash
x=10
function f {
x=$(($1 * $1))
echo $x
}
What is printed?
f $x
100
echo $x 100
106
Lexical Scoping (R)

## Example of Lexical Scoping


x <- 10
f <- function(x) {
x<- x^2
print(x)
}
f(x)
print(x)
What is printed out, try it in your RStudio.

107
Date and Time

• Dates are represented by Date class.


• Times are represented by the POSIXct and POSIXlt
class
• Dates and times are stored internally as number since
1970-01-01

108
Date and Time: Hands-on
> today <- "3/6/2015"
> class(today)
> today.date<-as.Date(today,"%d/%m/%Y")
> today.date
> class(today.date)
> unclass(today.date)

Try This!

109
Date and Time: Hands-on

If involves time, then POSIXct() should be used

> date1 <- as.POSIXct("2015-05-28


01:00:00")
> date2 <- as.POSIXct("2015-06-28
01:00:00")
> date1–date2

POSIXct stores the number of seconds since 1st January 1970

110
Date: the lubridate package

• R has a specific date package named lubridate to manipulate


date easier.
> library(lubridate)
> ymd(“20150715”)
> mdy(“07/15/2015”)
> dmy(“15/07/2015”)
> ymd_hms(“2015-07-15 18:47:00”)
> wday(z)
> wday(z,label=TRUE)

111
Date
%a – weekday
%d – day number
• To retrieve the date of today,
%b month
> d1 <- Sys.Date() %y - year

• To extract the information of the date,


> format(d1, “%a %b %d”)

• Changing characters to date


> x<- “15July2015”
> z<- as.Date(x, “%d%b%Y”)

112
Date

• Extracting parts of a date, try the following

> weekdays(z)
> months(z)
> julian(z)

This will tell you how


many days since the
“origin”

113
Exercise 1
• Use the airquality dataset
• Noticed that there are two variables, month and day
• Use paste() to combine them and store in a new
vector
• Convert it to a data format, (hint: use as.Date())

114
1000 – 1040
USING THE R “APPLY”
FUNCTIONS
Looping Functions

• R has some functions which implement looping in a


compact form to make your life easier.
• lapply(): Loop over a list and evaluate a function on each
element
• sapply(): Same as lapply but try to simplify the result
• apply(): Apply a function over the margins of an array
• tapply(): Apply a function over subsets of a vector
• mapply(): Multivariate version of lapply

116
lapply()

• The lapply() function does the following simple series


of operations:
1. it loops over a list, iterating over each element in that
list
2. it applies a function to each element of the list (a
function that you specify)
3. and returns a list (the l is for “list”).

It’s important to remember that lapply() always returns a list,


regardless of the class of the input.

117
lapply()

> x <- list(a = 1:5, b = rnorm(10))


> lapply(x, mean)

## Output
$a
[1] 3
$b
[1] 0.1322028

118
lapply()

> x <- list(a = matrix(1:4, 2, 2), b =


matrix(1:6, 3, 2))
> lapply(x, function(elt) { elt[,1] })

## What is the output of this lapply()

119
sapply()

• The sapply() function behaves similarly to lapply();


the only real difference is in the return value.
sapply() will try to simplify the result of lapply() if
possible. Essentially, sapply() calls lapply() on its
input and then applies the following algorithm:

• If the result is a list where every element is length 1, then


a vector is returned
• If the result is a list where every element is a vector of
the same length (> 1), a matrix is returned.
• If it can’t figure things out, a list is returned
120
sapply()

## Output with lapply()


> x <- list(a = 1:4, b = rnorm(10), c
= rnorm(20, 1), d = rnorm(100, 5))
> lapply(x, mean)

## Output with sapply()


> sapply(x, mean)

121
apply()

• The apply() function is used to a evaluate a function


(often an anonymous one) over the margins of an
array.

• It is most often used to apply a function to the rows


or columns of a matrix (which is just a 2-dimensional
array).

122
apply()

> x <- matrix(rnorm(200), 20, 10)


> apply(x, 2, mean) ## Take the mean
of each column
> apply(x, 1, sum) ## Take the sum of
each row

123
apply()

• some useful shortcuts

rowSums = apply(x, 1, sum)


rowMeans = apply(x, 1, mean)
colSums = apply(x, 2, sum)
colMeans = apply(x, 2, mean)

124
tapply()
• tapply lets you iterate over a data type called factor.
Therefore, good for grouping purpose.
• E.g,
> name<-c("Tan","Tan","Tan","Lee","Lee","Lee")
> subject<-c("IT","CS","AI","IT","CS","AI")
> marks<-c(90,95,80,90,99,85)
> df<-data.frame(name,subject,marks)

125
tapply()

• To get the sum of marks based on name, you type


> tapply(df$marks,df$name,sum)

How about looking for mean for each subject?

126
Short Exercise

• Using the “apply” function on the airquality dataset


to compute the:

• Total amount of radiation (sum) by month


• The average (mean) daily Ozone level.

127
Looping …

• Load the dataset named “iris”.


> library(datasets)
> data(iris)
> ?iris

128
Try some of the following commands

> rowMeans(iris[,1:4]) Now, are


> colMeans(iris[,1:4]) These are
they the
> apply(iris, 1, mean)
the same?
same?
> apply(iris[, 1:4], 2, mean)

129
• Load the dataset named “mtcars”.
> library(datasets)
> data(mtcars)
> ?mtcars

130
Try some of the following commands

> tapply(mtcars$mpg, mtcars$cyl, mean)


> with(mtcars, tapply(mpg, cyl, mean))

> split(mtcars, mtcars$cyl)


Basically to make it
> lapply(mtcars, mean) easier to type using
the function
“with()”

131
Morning Break
Please be back by 11:00 am
1100 – 1200 PM
RESHAPING DATA, SUB-SETTING
OBSERVATIONS & VARIABLES,
SUMMARIZING DATA.
Sub-setting Data

Create some data

> set.seed(1)
> x <- data.frame("var1"=sample(1:5),

"var2"=sample(6:10),"var3"=sample(11:15))
> x$var2[c(1,3)]=NA
> x

134
Subsetting Data: Hands-on

> x[,1]
> x[1:2, “var2”]
> x[x$var1<=3 & x$var3 >10,]
> x[x$var1>2 | x$var3 >10,]

Task: extract all rows with “NA”

135
Subsetting Data: Hands-on
Data for var1
> x[x$var2>1,] and var3 not
presented
correctly for
row with NA

136
Subsetting Data: Hands-on

> x[which(x$var2>1),]

137
Subsetting Data: Hands-on

• To retrieve row having maximum value for “var1”


> x[which.max(x$var1),]

> Task: look for row with minimum value


for var3

138
Sorting Data

• To sort values for a particular column


> sort(x$var1)
• to sort decreasingly
> sort(x$var1,decreasing=TRUE)
• What if I have “NA”
> sort(x$var2,na.last=TRUE)

139
Sorting Data

How to sort the entire dataset based


on a particular column, say “var1”?

140
Sorting Data: Hands-on

> x[order(x$var1),]

To sort data based on multiple columns,


> x[order(x$var1, x$var3),]

141
Sorting Data: plyr package

> library(plyr)
> arrange(x, var1)
> arrange(x, desc(var1))

142
Quantile

> x <- rnorm(100, 50,20)


> hist(x)
> quantile(x)

143
Merging data

Let’s create two data frames:


> set.seed(1)
> x<- sample(1:20,10)
> y<- sample(30:50,10)
> dt.1 <- data.frame(x,y)

> set.seed(2)
> x<- sample(1:20,10)
> y<- sample(30:50,10)
> dt.2 <- data.frame(x,y)
144
Merging data

• To perform an inner join by column “x”


> merge(dt.1, dt.2, by="x")

• To perform an inner join by column “x” with either


one matched,
> merge(dt.1, dt.2, by="x", all =
TRUE)

• To perform an inner join by column “x” and “y”


> merge(dt.1, dt.2, by=c("x","y"))
145
Merging Data: Hands-on

> df1<-data.frame(CustomerId = c(1:6),


Product = c(rep("Honda", 3),
rep("Chevrolet", 3)))

> df2<-data.frame(CustomerId = c(2, 4, 7),


State =
c("Selangor","Sarawak","Kelantan"))

How does the


data frames
look like?

146
Merging Data: Hands-on

Task 1:
Return only the rows in which the df1 have
matching keys in df2.

147
Merging Data: Hands-on

Task 2:
Return all rows from the df1, and any rows with
matching keys from df2.

148
Merging Data: Hands-on

Task 3:
Return all rows from the df2, and any rows with
matching keys from df1.

149
LABORATORY EXERCISE –
GETTING, FORMATTING AND
STORING DATA
Downloading a File

• Go to http://www.data.gov.my/data/dataset/b5cd948f-cffb-
4439-ae08-e508ff073a93/resource/1f2d5629-ac8d-449a-a4c1-
9d269d625d84/download/lokalitihotspot2015.xlsx
and download dengue hotspot for 2015

151
Downloading a File

> url<-
"http://www.data.gov.my/data/dataset/b5cd948f-
cffb-4439-ae08-e508ff073a93/resource/1f2d5629-
ac8d-449a-a4c1-
9d269d625d84/download/lokalitihotspot2015.xlsx"
> download.file(url,"dengue.xlsx",mode='wb')

152
Downloading a File

• Convert the file to CSV for ease of reading


> dt <- read.csv(‘dengue.csv’)

• Check the field names of the dataset


> names(dt)

153
Column Name Manipulation
fixed=TRUE to
make sure gsub
treats “.” and a
• Remove dot(.) in all the field names dot, not a
function
> names(dt) <- gsub(".", "", names(dt),
fixed=TRUE)

154
Column Name Manipulation

• To rename all the columns,


> names(dt) <-
c("Year","Week","State","District","Locat
ion","Total","Outbreak Duration")

• To rename a specific column only,


> names(dt)[2] <- ‘Week No’

Check the
changes to
the columns

155
Column Name Manipulation

• Replace all space the field names with “_”.


> names(dt)<-gsub(“ ”, “_”, names(dt))

156
Records Manipulation

• Inspect the data for location, you can see there are many
wrong spellings. Say, if we want to replace “Tmn” to “Taman”
for all data in Location
> dt$Location <-gsub(“Lndah,
“Indah”, dt$Location)

Check the
Try replacing all “Kg” to “Kampung” data first
before
replacing
anything

157
String split

• The command = strsplit()


> strsplit(“hello Malaysia !”, “ “)

• To split a particular record based on a particular character in


a dataset,
> strsplit(as.character(dt$Location), “
“)[[1]]

158
Finding Values

• Say if we want to know how many times the word “Taman”


has appeared in the “Location” field,

> grep("Taman",dt$Location)
> grep("Taman",dt$Location, value=TRUE)

How many of
them? What
command to
use?
159
Finding Values

• How many with “Taman” and otherwise?

> cnt<-table(grepl("Taman",dt$Location))
> barplot(cnt)

What do
you get?
Explain…

160
Finding Values

• Now, store all the data with the word “Taman” in Location in a
variable named dt_Taman.

> dt_Taman <- dt[grepl("Taman", dt$Location),]

How many records in dt_Taman ?

161
Finding Values

• Now, store all the data with the word “Seksyen” or “Medan”
in Location

> dt[grepl("Seksyen|Medan", dt$Location),]

162
Finding Values

• Looking for any characters.


• For example, we could retrieve locations that begins with the
character “T” and contains character “1” from the “Location”
column.

> dt[grep("^T(.*)1", dt$Location),"Location"]

163
Exercise 2 (re-read the dengue csv file)
• What is the command to split data according to
“Negeri”?
• What is the command to split “Jumlah.Kes.Terkumpul”
according to “Negeri”?
• Apply a function across elements of the list
• How to get the sum of Jumlah.Kes.Terkumpul for each
Daerah.Zon.PBT?

164
Homework

• Look for all locations with the words “Jalan” and


“Jaya”.
• Retrieve all records with the word “Taman” in
“Location” in the district “Petaling”.
• How many “Kampung” in Selangor and Perak. Plot a
graph to compare the two.

[estimation = 20 minutes]

165
Lunch Break
Please be back by 2:00 pm
1500 – 1540
GETTING DATA FROM
DIFFERENT SOURCES
Downloading files

• Always check and set your working directory using the


getwd() and setwd()commands
• To check whether the “Data” directory has been created or
not:

if (!file.exists("data")){
dir.create("data")
}

168
Read MySQL

• mySQL is a free and widely used open source


database software
• Install MySQL from here
• use library(RMySQL)

169
UCSC MySQL

170
Demo (connect and read tables)

• Connect to UCSC Genome Online MySQL databases


• List all the databases
• Connect to “hg18” database
• List all the tables in “hg18” database
• Show tables 1 to 5
• List all the fields for “HInvGeneMrna" table”
• Select all from “HInvGeneMrna" table where
“qNumInsert is between 2 and 3”
• Show the result

171
Basic RMySQL commands

• ucscDb<-dbConnect(MySQL(), user="genome",
host="genome-mysql.cse.ucsc.edu") #to
connect to MySQL with username and password
• dbGetQuery #execute and fetch SQL queries
• dbListTables #list all tables
• dbListFields #list all fields for table
• dbReadTable #read content from table
• dbSendQuery #execute SQL queries
• dbDisconnect #disconnect connection

172
Read from the Web

• Webscraping - programatically extracting data from


the HTML code of websites
• Use readLines() or parsing with XML
• Use GET from the httr package
• Websites with password, you are required to
provide username and password
• Use handles

173
Example Webpage

174
Read from the Web
> con= url("http://pesona.mmu.edu.my/~ccho")
> htmlCode=readLines(con)
> close(con)
> htmlCode

175
Exercise 3

• Read the HTML content from this URL:


http://pesona.mmu.edu.my/~ccho
• Using readLines function, find out how many
characters are in the 20th and 40th lines of HTML
from this page
• The nchar() function in R may be helpful

176
Further Resources – Read from the Web

• R bloggers has a number of examples of web


scraping http://www.r-
bloggers.com/search/web%20scraping

• The httr file has useful examples


https://cran.fhcrc.org/web/packages/httr/httr.pdf

177
Downloading files

• download.file() to download a file


from the Internet
• Important parameters are url,
destfile, method
• For example download Apple shares
price from Google Finance

fileURL <-
"http://www.google.com/finance
/historical?q=NASDAQ%3AGOOGL&e
i=hpYFWJuSIZGHuwSKg43wDA&outpu
t=csv"
download.file(fileURL,
destfile=(“alphabet.csv"))
list.files()
178
Exercise 4

• Download the data for “Direktori Polis DiRaja Malaysia Sehingga


September 2016” from this URL
http://www.data.gov.my/data/ms_MY/dataset/7864e7e5-92f5-
4b3e-9500-d4cf3f75696f/resource/3594f55b-52cd-4fe9-9aae-
2d6d2fb8c6bd/download/direktori-pdrm.csv
• Store in pdrm.csv file

179
Read local files (Revision)

• The main function is read.table


• read.table is flexible and robust requires more parameters()
>?read.table # to find out more about
read.table()
• read.table reads all data into RAM thus not advisable for large data
set
• important parameters are: file, header, sep, row.names, nrows

180
Exercise

• Read file with “,” as delimiter


> comma_dat <-read.table("http://bit.ly/1e9Cvzu",
header=TRUE, sep=",")

• Read the test_semicolon.txt with “;” delimiter file from


http://bit.ly/1RGbPn9 and store in semi_dat

• Show the content of semi_dat

181
read.csv () and read.csv2 ()
• read.csv and read.csv2 are identical to read.table except for the
defaults.
• They are intended for reading ‘comma separated value’ files (‘.csv’) or
(read.csv2) the variant used in countries that use a comma as decimal
point and a semicolon as field separator.

read.csv(file, header = TRUE, sep = ",", quote =


"\"", dec = ".", fill = TRUE, comment.char = "",
...)

read.csv2(file, header = TRUE, sep = ";", quote =


"\"", dec = ",", fill = TRUE, comment.char = "",
...)

182
Exercise 5

• Download the “Genting Malaysia Berhad” share prices (in


csv) from Yahoo! Finance (URL:
http://finance.yahoo.com/q/hp?s=GMALF+Historical+Prices)
• Store in genting.csv file
• Read the file and describe the data obtained:
• Number of observations
• Number of Variables
• Names of the variables
• Calculate average closing price
• Calculate number of days where the closing price is more than RM
1.35

183
Reading Excel Files
• library (xlsx) package

read.xlsx(file, sheetIndex, sheetName=NULL, rowIndex=NULL,


startRow=NULL, endRow=NULL, colIndex=NULL,
as.data.frame=TRUE, header=TRUE, colClasses=NA,
keepFormulas=FALSE, encoding="unknown", ...)

rowIndex - a numeric vector indicating the rows you want to extract. If NULL,
all rows found will be extracted, unless startRow or endRow are specified.
colIndex - a numeric vector indicating the cols you want to extract. If NULL,
all columns found will be extracted.
• read.xlsx2 is much faster but can be unstable
• in general it is advised to store data in either .csv or .txt format for easier
to distribute
184
Exercise 6
• Download the Excel spreadsheet on Natural Gas Acquisition
Program here:
https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FDA
TA.gov_NGAP.xlsx

• Read rows 74 to 79 and columns 14 to 21 into R and assign the


result to a variable called: xlsdat
• Calculate the sum for Supp_Vol * Supp_Org_Cost

185
Reading from XML

• Extensible Markup Language


• Frequently used to store structured data
• Particularly widely used in Internet applications
• Extracting XML is the basis for most web scraping
• Components
• markup - labels that give the text structure
• Content - the actual text of the document

• http://en.wikipedia.org/wiki/XML

186
Tags, elements and attributes

• Tags correspond to general labels


• Start tags <section>
• End tags </section>
• Empty tags <line-break />
• Elements are specific examples of tags
<Greeting> Hello, World </Greeting>
• Attributes are components of the label
<img src="khpoo.jpg" alt="instructor"
/>

187
Example XML file

188
Read the XML file into R

> library (XML)


> fileURL<-
"http://www.w3schools.com/xml/simplexsl.xm
l"
> doc<-xmlTreeParse(fileURL,
useInternal=TRUE)
> rootNode<-xmlRoot(doc)
> xmlName(rootNode)
> names(rootNode)

189
Directly access parts of the XML
document
> rootNode[[1]]

> rootNode[[1]][[1]]

190
Programatically extract parts of the file

> xmlSApply(rootNode,xmlValue)

191
XPath

• /node Top level node


• //node Node at any level
• node[@attr-name] Node with an attribute name
• node[@attr-name='bob'] Node with attribute with
attr-name='bob'

192
Get the items on the menu and prices

> xpathApply(rootNode, "//name", xmlValue)

> xpathApply(rootNode, "//price",


xmlValue)

193
Reading JSON

• Javascript Object Notation


• Lightweight data storage
• Use library (jsonlite) package
• Structure similar to XML but different syntax/format
• Read using the fromJSON function and to convert
data into JSON, use the toJSON function

194
Example of JSON file

195
Afternoon Break
Please be back by 4:00 pm
1600 - 1700

LABORATORY
EXERCISE
BASIC DATA SCIENCE
WITH

Day 3

Universiti Sains Malaysia October 2016


Training Schedule: Day 3
DAY3
0900-1000 Introduction to base plotting system
1000-1040 Laboratory exercise – base plot
1040-1100 Morning Break
1100-1200 Introduction to Lattice Plotting System – to be done
1200-1300 Laboratory exercise – Lattice Plot – to be done
1300-1400 Lunch
1400-1500 Introduction to ggplot2 plotting system
1500-1540 Laboratory exercise – ggplot2 plot
1540-1600 Afternoon Break
1600-1700 Laboratory exercise – Experiments with colours in plot
0900 – 1040
INTRODUCTION TO BASE PLOT
Plotting Systems

•R has 3 main systems for the plotting of graphs


• {base}
• {lattice}
• {ggplot2}

•Graphics plotting devices in R

201
Principle of Analytic Graphics

• The basis of Analytic Graphics is


NOT to simply plot uni- or multi-
variate graphs or make beautiful
graphics using fanciful tools that you
have.

• It is to describe and document your


data in order for appropriate
analytics to be applied to it.

202
Plotting systems in R

•There are 3 common plotting systems available


in R.
• {base} which is in the default R installation.
• {lattice} which is loaded using the
library(lattice) command.
• {ggplot2} which is loaded using the
library(ggplot2) command.

203
{base} graphics system
• Common main plotting functions.
• hist(), barplot(), boxplot(), plot()

• They are usually annotated with separate functions.


The following functions are typically used to add
elements to a plot.
• points(), lines(), text()

204
Demo of {base} plotting system

•Using the mtcars dataset

•Basic plots using


• hist(), barplot(), boxplot(),
plot()

•Annotation using
• points(), lines(), text()

205
Some Important Base Graphics
Parameters
• Many base plotting functions share a set of parameters. Here
are a few key ones:
• pch: the plotting symbol (default is open circle)
• lty: the line type (default is solid line), can be dashed,
dotted, etc.
• lwd: the line width, specified as an integer multiple
• col: the plotting color, specified as a number, string, or hex
code; the colors() function gives you a vector of colors by
name
• xlab: character string for the x-axis label
• ylab: character string for the y-axis label

206
col and pch
R colour chart:
http://research.stowers-
institute.org/efg/R/Color/
Chart/ColorChart.pdf

207
Line types

208
Some Important Base Graphics
Parameters
• The par() function is used to specify global graphics
parameters that affect all plots in an R session. These
parameters can be overridden when specified as arguments
to specific plotting functions.
• las : the orientation of the axis labels on the plot
• bg : the background color
• mar : the margin size
• oma : the outer margin size (default is 0 for all sides)
• mfrow : number of plots per row, column (plots are filled
row-wise)
• mfcol : number of plots per row, column (plots are filled
column-wise)
209
Plots

Basic Histogram Line graph with regression

Basic Boxplot Scatterplot 210


Basic Histogram

ylim main
library(datasets)
hist(warpbreaks$b
reaks,
break breaks=20,
s xlab =
"Breaks",
xlim main="Number
ylab Breaks in Yarn
during Weaving",
xlab
ylim =
c(0,20))

211
Basic Line Plot
library("MASS")
data("cats")
plot(cats$Bwt, cats$Hwt,
type="l",
text() col="red",
lwd=1,
ylab="Heart weight
(Kg)",
xlab="Body weight
(Kg)",
main="Anatomical
abline(fit1 features of house cats")
type ) fit1<-lm(formula= cats$Hwt ~
cats$Bwt)
lwd abline(fit1, lty="dashed")
col #sample of text to be placed
in plot
text(x=2.3, y=18,
labels="R2=0.896\n P=2615e-
15")
212
Basic Scatterplot

library(datasets)
legen plot(iris$Sepal.Length,
d() iris$Petal.Length,
col=iris$Species,
pch=16,
cex=0.5,
xlab="Sepal Length",
col ylab="Petal Length",
pch main="Flower
Characteristics in Iris")
cex legend(x=4.2, y=7,
legend=levels(iris$Species),col
=c(1:3), pch=16)

213
Basic Boxplot
library(datasets)
boxplot(iris$Sepal.Length ~ iris$Species,
ylab="Sepal Length",
xlab="Species",
main="Sepal Length by Species in Iris")

214
{base} graphics system – par()
• Panel plotting
• 1 by 2
• 2 by 2
• 3 by 1

• par()
• mfrow()

e.g. par(mfrow=c(2,2))

215
Par () margins

e.g.par (mar=c(3,4,3,4),
oma=c(3,4,3,4)) 216
Exercise 1

20 minutes

217
Exercise 2a (plot disp ~ mpg)

10 minutes

218
Exercise 2b (plot disp ~ mpg)

40 minutes
• Main title
• Plotting symbols type, size
and colour
• X and y axes labels
• X and y axes limits
• Include absline() that is
dashed, red thicker

219
{base} graphics system – output

• Demonstration on the various common formats


• PDF – pdf()
• PNG – png()
• JPG – jpg()
• Screen (in MS-Windows systems, it’s called windows())
• dev.copy()
• dev.copy2pdf()

220
{base} graphics system - output

> png("mtcars.png", 600, 600)


> plot(mtcars$disp, mtcars$mpg, xlab="Displacement
(cu, in)", ylab="Miles Per Gallon", pch=16,
col="gray", main="MPG vs Displacement",
ylim=c(10,35), xlim=c(85,500), bty="n")
> dev.off()
Or
> plot(mtcars$disp, mtcars$mpg, xlab="Displacement
(cu, in)", ylab="Miles Per Gallon", pch=16,
col="gray", main="MPG vs Displacement",
ylim=c(10,35), xlim=c(85,500), bty="n")
> dev.copy(png,'mtcars.png')
> dev.off()

221
Graphics Device
• Computer Screen •NOT
• Input (such as Mouse &
Keyboard)
• File System
• Network Connection
Bitmap vs Vector Graphics
• Bitmap • Vector
• BMP • PS
• JPG / JPEG • EPS
• TIFF • SVG
• GIF
• PNG • Good for resizing, scaled
plots.
• Better for point kind of
plots such as scatter plots
and density plots.
Morning Break
Please be back by 11:00 am
1100 - 1200
INTRODUCTION TO LATTICE
PLOT
Lattice plot

• The lattice package, improves on the base R


package by providing better defaults and the
ability to display multivariate relationships.

• It is particularly good at the creation of the trellis


graph, a graph that display a variable or the
relationship between variables, conditioned on one
or more other variables.

226
Lattice plot format
• graph_type(formula, data=)

graph_type description formula examples


barchart bar chart x~A or A~x
bwplot boxplot x~A or A~x
cloud 3D scatterplot z~x*y|A
contourplot 3D contour plot z~x*y
densityplot kernal density plot ~x|A*B
dotplot dotplot ~x|A
histogram histogram ~x
levelplot 3D level plot z~y*x
parallel parallel coordinates plot data frame

splom scatterplot matrix data frame


stripplot strip plots A~x or x~A
xyplot scatterplot y~x|A
wireframe 3D wireframe graph z~y*x 227
Try it yourself

# Lattice Examples
library(lattice)
attach(mtcars)

# create factors with value labels


gear.f<-factor(gear,levels=c(3,4,5),
labels=c("3gears","4gears","5gears"))
cyl.f <-factor(cyl,levels=c(4,6,8),
labels=c("4cyl","6cyl","8cyl"))

228
Try it yourself

# kernel density plot


densityplot(~mpg,
main="Density Plot",
xlab="Miles per Gallon")

# kernel density plots by factor level


densityplot(~mpg|cyl.f,
main="Density Plot by Number of Cylinders",
xlab="Miles per Gallon")

229
Try it yourself
# kernel density plots by factor level (alternate layout)
densityplot(~mpg|cyl.f,
main="Density Plot by Numer of Cylinders",
xlab="Miles per Gallon",
layout=c(1,3))

# boxplots for each combination of two factors


bwplot(cyl.f~mpg|gear.f,
ylab="Cylinders", xlab="Miles per Gallon",
main="Mileage by Cylinders and Gears",
layout=(c(1,3))

230
Customizing Lattice plots

• Lattice graphs are not effected by many of the


options set in the par( ) function. To view the options
that can be changed, look at help(xyplot).

231
Customizing Lattice plots

• # Customized Lattice Example


library(lattice)
panel.smoother <- function(x, y) {
panel.xyplot(x, y) # show points
panel.loess(x, y) # show smoothed line
}
attach(mtcars)
hp <- cut(hp,3) # divide horse power into three bands
xyplot(mpg~wt|hp, scales=list(cex=.8, col="red"),
panel=panel.smoother,
xlab="Weight", ylab="Miles per Gallon",
main="MGP vs Weight by Horse Power")

232
Lunch Break
Please be back by 2:00 pm
1400 - 1500
INTRODUCTION TO GG PLOT 2
{ggplot2} graphics system
• Created by Hadley Wickham
• Based on Leland Wilkinson's
The Grammar of Graphics.
• Widely used because
considered the best R
package for static
visualization.
• Package = ggplot2.
• The main function is
ggplot().
{ggplot2} introduction

install.packages("ggplot2")
library(ggplot2)
head(mtcars)

236
{ggplot2} plot layers

library(ggplot2)
g <- ggplot(data = mtcars, aes(x = hp, y
= mpg))
print(g)

> ggplot(data = mtcars, aes(x =


hp, y = mpg))
Error: No layers in plot
• What is a geometry objects (geom) in the ggplot2 system?
• a plotting object like point, line, or other shape

237
{ggplot2} Scatterplots
ggplot(data = mtcars, aes(x = hp, y = mpg)) + geom_point()

238
Exercise 3: Adding colours to the plot

Modify the
content of the
aes() to include
colours and
produce the
following
graph.
ggplot(data = mtcars, aes(x = hp, y = mpg))
+ geom_point()
239
Exercise 4: Adding legend label to the
plot

Include the
appropriate
parameters
here (hint:
vectors)
+ scale_color_discrete(labels = )
240
Exercise 5: Re-labelling

labs(color = "Transmission", ? ? )
241
{ggplot2} Scatterplots
• Add different themes to
your plots
• theme_bw()
• theme_light()
• theme_minimal()
• theme_classic()

• Which one of the above


produces the graph shown
here?
{ggplot2} Scatterplots

ggplot(data = mtcars, aes(x = hp, y = mpg,color =


factor(am), alpha = wt, size = cyl)) +
geom_point() + scale_color_discrete(labels =
c("Automatic", "Manual")) + labs(color =
"Transmission", x = "Horsepower", y = "Miles Per
Gallon", alpha = "Weight", size = "Cylinders") 243
{ggplot2} Bar Charts

> head(diamonds)

• Instead of geom_point() use geom_bar()


• For the next few exercises, use the ? (help) to assist you.

244
Exercise 6a: {ggplot2} Bar Charts
• Using the “diamond”
data and geom_bar()

• Count the frequency of


the diamonds by “clarity”
Exercise 6b: {ggplot2} Bar Charts

• The bars can be


“fill”ed accordingly.

• Use the diamond “cut”


to fill it appropriately.
Exercise 6c: {ggplot2} Bar Charts
• Instead of stacking the
“cut” type, have them
side by side in the bar
chart.

• The geom_bar() has


the option of “dodge”.
{ggplot2} Line Charts

• Typically used for visualizing how a continuous variable (on


the y-axis) changes in relation to another continuous variable
(on the x-axis).
• Often the x variable represents time, but it may also represent some
other continuous quantity.

• Appropriate when the variable is ordered (e.g., “small”,


“medium”, “large”), but not when the variable is unordered
(e.g., “cow”, “goose”, “elephant”)

• In ggplot2() system, the plot layer is

geom_line()
248
{ggplot2} Line Charts

set.seed(1)
time <- 1:10
income <-
runif(10,1000,5000)
dt <-
data.frame(time,income)

249
{ggplot2} Line Charts
• Using the dataset created
(previous slide)

• Plot the graph using


geom_line() as the
plot layer.

ggplot(dt, aes(x = time, y = income))


+ geom_line()
{ggplot2} Line Charts
• Use both the
geom_line() and
geom_point() on the
data created to produce
the graph.

ggplot(dt, aes(x = time, y = income)) +


geom_line() + geom_point()
{ggplot2} Line Charts

• Separate it into two


categories. Let’s call them
“high” and “low”

type <- c("low", "high", "low",


"high", "low", "high", "low",
"high", "low", "high")

• Then plot it to obtain the


graph.

ggplot(dt, aes(x = time, y = income, colour = type)) +


geom_line() + geom_point()
{ggplot2} geom_smooth()

set.seed(1)
time<-rep(1:5, each=2)
income <- runif(10,1000,5000)
dt <- data.frame(time,income,type)
ggplot(dt, aes(x = time, y = income, colour = type)) +
geom_smooth()
253
1500 - 1540
GGPLOT2 EXERCISE
Exercise 7: data("msleep")

255
Afternoon Break
Please be back by 4:00 pm
1600 - 1700
LABORATORY EXERCISE –
COLOURS AND PLOTS
Plotting and Colour in R

• A palette in R is simply a vector of colors. This vector can be


include the hex triplet or R color names.
• All R plots come with default colours. They are in general “black”
or “grey” colours.
• In R Base plotting system, colour is defined by the parameter
“col”
• Users can modify the colours to fit specific purposes.
• Plots with colours are usually generated layer by layer. For
example in a scatterplot with three colours (blue, green, red),
you may start with plotting blue dots, followed by green dots,
and lastly with red dots.
Plotting and Colour in R

•The grDevices package has two functions:


•colorRamp
•colorRampPalette
Plotting and Colour in R

x <- rnorm(10000)
y <- rnorm(10000)
plot(x,y)
smoothScatter(x,y)
Experiment with Colours

•Setting a graphical parameter using par()

par() # view current settings


opar <- par() # make a copy of current settings
par(col.lab="red") # red x and y labels
hist(mtcars$mpg) # create a plot with these new
settings
par(opar) # restore original settings
Experiment with Colours

•You can also set the parameter inside the


function

hist(mtcars$mpg, col.lab="red")
hist(mtcars$mpg, col.lab=552)
Experiment with Colours
Experiment with Colours

• Getting the colour names given the number:

colors()[c(552,254,26)]
[1] "red" "green" "blue"

grep("red",colors())
[1] 100 372 373 374 375 376 476 503 504 505 506 507 524 525 526
[16] 527 528 552 553 554 555 556 641 642 643 644 645
Experiment with Colours

colors()[grep("red",colors())]
[1] "darkred" "indianred" "indianred1" "indianred2"
[5] "indianred3" "indianred4" "mediumvioletred" "orangered"
[9] "orangered1" "orangered2" "orangered3" "orangered4"
[13] "palevioletred" "palevioletred1" "palevioletred2" "palevioletred3"
[17] "palevioletred4" "red" "red1" "red2"
[21] "red3" "red4" "violetred" "violetred1"
[25] "violetred2" "violetred3" "violetred4"
Experiment with Colours

colors()[grep("sky",colors())]
[1] "deepskyblue" "deepskyblue1" "deepskyblue2" "deepskyblue3"
[5] "deepskyblue4" "lightskyblue" "lightskyblue1" "lightskyblue2"
[9] "lightskyblue3" "lightskyblue4" "skyblue" "skyblue1"
[13] "skyblue2" "skyblue3" "skyblue4"
Experiment with Colours
with(airquality, plot(Wind,Ozone, type="n"))
with(subset(airquality,Month==5), points(Wind, Ozone, col="blue"))
with(subset(airquality,Month!=5), points(Wind, Ozone, col="red"))
legend("topright",pch=1,col=c("blue","red"),legend = c("may","other
months"))
Experiment with Colours

•Plot using random colours


> n=seq(1:12)
> barplot(n)
> barplot(n,col=rainbow(4))
Experiment with Colours
n=12
pie(rep(1,n), col=FALSE)
pie(rep(1,n), col=rainbow(5))
pie(rep(1,n), col=heat.colors(n))
pie(rep(1,n), col=terrain.colors(n))
pie(rep(1,n), col=topo.colors(n))
Experiment with Colours

•You can define your own colour for your plot

x <- rnorm(10000)
hist(x,col="blue")
mycolor<-c("blue","red","green")
hist(x,col=mycolor)
Experiment with Colours
•Define your own colour

heatcols <- heat.colors(6)


hist(x, col = heatcols)
Experiment with Colours

library(RColorBrewer)
display.brewer.all()
• RColorBrewer works a
little different than how
we’ve defined palettes
previously. We’ll have to
use brewer.pal to create
a palette.
Experiment with Colours

library(RColorBrewer)
darkcols <- brewer.pal(8, "Dark2")
hist(x, col = darkcols)
Exercise 7b: data("msleep")

274
BASIC DATA SCIENCE
WITH

Day 4

Universiti Sains Malaysia October 2016


Training Schedule: Day 4
DAY4
0900-1000 R, Maps, Markdown, Rpubs
1000-1040 R Markdown for documentation
1040-1100 Morning Break
1100-1200 R and predictive analytics using caret package
1200-1300 Laboratory exercise – caret package
1300-1400 Lunch
1400-1500 Introduction to Shiny
1500-1540 Laboratory exercise – Rmd, Rpubs with R Studio and Shiny
1540-1600 Afternoon Break
1600-1700 Wrap-up
R MARKDOWN
What is Markdown?

Markdown is a text-to-HMTL conversion tool for


web writers. Markdown allows you to write
using and easy-to-read, easy-to-write plain
text format, then convert it to structurally valid
XHTML (or HTML).
John Gruber
Creator of Markdown

278
Markdown Syntax

• Headings
# This is Heading 1
## This is Heading 2
### This is Heading 3
• Italics
*This is italic*
• Bold
**This is Bold**

279
Markdown Syntax

Unordered Lists

-First item
-Second item
-Third item

280
Markdown Syntax

Ordered Lists

1. First item
2. Second item
3. Third item

281
Markdown Syntax

• Creating links and footnotes

[Poo Kuan Hoong’s web


page](http://pesona.mmu.edu.
my/~khpoo )

[Poo Kuan Hoong’s web


page][1]
[1]:
http://pesona.mmu.edu.my/~k
hpoo “my homepage”

282
Markdown Syntax

•Newlines
• Make sure you use a double space to create a
new line.

283
R Markdown

• Is an integration of Markdown to support R code.


• Allows one to create documents containing “live” R
code.
• Results of R code are inserted into markdown
document.
• R markdown can be converted to standard
markdown using the knitr package in R.
• R markdown  markdown  HTML.
• R markdown  slides. Using the slidify package

284
Knitr : Strengths vs Weaknesses
Strengths Weaknesses
Text and code all in one place Difficult to read as code and
text jumbled up
Results updated automatically Processing of code,
particularly loading of large
files, can take time
Code is live! No code, no document
Good For: Not good for:
• Manuals • Long research articles
• Short-to-Medium Documents • Documenting complex
• Tutorials processes
• Reports • Documents with precise
• Data processing summaries formatting 285
R Markdown: Hands-on

•Create a new R Markdown document.

286
R Markdown: Hands-on

• Clear everything in the newly created R Markdown


document.
• Type in the following code:

My first R Markdown document


============================

### loading the car dataset


In this section, I shall explain how to
load the build-in dataset named
**mtcars**.
287
R Markdown: Hands-on

•Click the button knit HTML

288
If you don’t like clicking on buttons…

In your R-code, you can knit a document using the


following code

library(knitr)
knit2html("document.Rmd")
browseURL("document.html")

289
knitr notes
• Code chunks begin with ```{r} and end with ```
• All R code goes in between these markers
• Code chunks can have names, which is useful when we start making graphics
```{r firstchunk}
## R code goes here
```"
• By default, code in a code chunk is echoed, as will the results of the
computation (if there are results to print)
• Three things that happens when you knit
• You write the RMarkdown document (with a .Rmd extension)
• knitr produces a Markdown document (with a .md extension)
• knitr converts the Markdown document into HTML (by default)
• You should NOT edit (or even save) the .md or .html documents until you are
finished; these documents will be overwritten the next time you knit the .Rmd
file. 290
R Markdown: Hands-on
```{r}
library(datasets)
library(ggplot2)
str(mtcars)
```

plotting graph wt against mpg

```{r}
ggplot(mtcars, aes(wt, mpg)) +
geom_point(aes(colour=factor(cyl), size
= qsec))
``` 291
R Markdown: Hands-on
```{r}
library(datasets)
library(ggplot2)
str(mtcars)
```

plotting graph wt against mpg

```{r}
ggplot(mtcars, aes(wt, mpg)) +
geom_point(aes(colour=factor(cyl), size
= qsec))
``` 292
R Markdown: Hands-on

• The command echo=FALSE will hide the R


code from being displayed at the output.

293
R Markdown: Hands-on

•Using the previous code, what happens if you


include echo=FALSE?

plotting graph wt against mpg

```{r echo=FALSE}
ggplot(mtcars, aes(wt, mpg)) +
geom_point(aes(colour=factor(cyl), size
= qsec))
```
294
R Markdown: Hands-on

• Using the previous code, what happens if you include


echo=FALSE, message=FALSE,
warning=FALSE, fig.keep='none',
results='hide'?

plotting graph wt against mpg

```{r echo=FALSE}
ggplot(mtcars, aes(wt, mpg)) +
geom_point(aes(colour=factor(cyl), size = qsec))
```

295
Converting R Markdown to a
presentation

•Although PDF files can be used in place of a


presentation deck, it lacks effects such as
transition and precise control. A R markdown
file can be converted to the following
presentation format:

• ioslides ( A Google presentation format)


• Slidy ( A web presentation format)
• Beamer ( A LaTeX based presentation format)
296
Converting R Markdown to a
presentation
• To convert a R Markdown document to the required
format, you need to modify the YAML

---
title: "Habits"
author: "John Doe"
date: "March 22, 2005"
output: beamer_presentation OR output:
ioslides_presentation
OR output: beamer_presentation
---
297
Converting R Markdown to a
presentation
• Or you can choose the correct drop down option
from R Studio

298
Rpubs
•You may submit your
output to a free R
Markdown hosting
website by Rstudio
named Rpubs.

•Navigate to
www.rpubs.com

299
Exercise

•Create a one page research project


website using Markdown and publish it to
Rpubs.

•Incorporate picture(s) and links

300
Morning Break
Please be back by 11:00 am
Introduction to the Caret package
What is Caret?
• useful set of front-end tools /
wrapper
• caret.r-forge.r-project.org
• Functionality
• Some preprocessing (cleaning):
preProcess
• For data splitting:
createDataPartition, createResample,
createTimeSlices
• Some training/testing functions: train,
predict
• Tool for model comparison:
confusionMatrix 302
ML techniques supported by R

•Linear discriminant analysis


•Regression
•Naive Bayes
•Support vector machines
•Classification and regression trees
•Random forests
•Boosting
•And much more!
303
Why caret?

• R has many packages for predictive modeling (aka


machine learning)(aka pattern recognition) . . .
• Since there are many modeling packages written by
different people, there are some inconsistencies in
how models are specified and predictions are made.
• The caret package was developed to:
• create a unified interface for modeling and prediction
(currently 147 different models)
• streamline model tuning using resampling provide a variety
of “helper” functions and classes for day–to–day model
building tasks
• increase computational efficiency using parallel processing
304
Using caret on the SPAM dataset
#load library, ensure e1071 is also installed #final model
library(caret); modelFit$finalModel # gives predictions
library(kernlab); vector
#split data into training and test set #predictions on testing data
predictions <- predict(modelFit,
data(spam)
newdata=testing)
inTrain <- createDataPartition(y=spam$type, predictions
p=0.75, list=FALSE)#75% training
#evaluate prediction using error measures
training <- spam[inTrain, ]
confusionMatrix(predictions, testing$type)
testing <- spam[-inTrain, ]
dim(training) # 3451 training, 58 testing, dim returns
dimension of training
set.seed(32343)#ensure similar seed to get
reproducible results when you try on your own
#fitting phase using a Generalized Linear Model
modelFit <- train(type ~ ., data=training,
method="glm")
modelFit
modelFit <- train(type ~ ., data=training,
method="glm") 305
Reminders on caret
Reference
Suppose a 2x2 table with notation Predicted Event No Event
The formulas used here are: Event A B
Sensitivity = A/(A+C) No Event C D
Specificity = D/(B+D)
Prevalence = (A+C)/(A+B+C+D)
PPV = (sensitivity * Prevalence)/((sensitivity*Prevalence) +
((1-specificity)*(1-Prevalence)))
NPV = (specificity * (1-Prevalence))/(((1-
sensitivity)*Prevalence) + ((specificity)*(1-Prevalence)))
Detection Rate = A/(A+B+C+D)
Detection Prevalence = (A+B)/(A+B+C+D)
Balanced Accuracy = (Sensitivity+Specificity)/2
For more than two classes, these results are calculated
comparing each factor level to the remaining levels (i.e. a
"one versus all" approach).

306
Data Slicing

• When performing data slicing (especially for


splitting into training and testing data), we have the
following options:
• K-Fold (split data into K-number of subsets)
• Resampling (uses simple random sampling to sample)
• Time-Slice (more for time series data)

307
Training Options
• The default setting for training any model can be seen
by using the
args(train.default) command, including those for the
error/performance metrics
• Metric options
• Continuous outcomes: RMSE=root mean squared error,
RSquared = R^2 from regression models
• Categorical outcomes: Accuracy = fraction correct, Kappa =
measure of concordance
• Default will follow from whether variable is a factor or not

308
Training options for Resampling

• For more control, one can use the args(trainControl) command.


Example for Resampling
• method: boot=bootstrapping, boot632=bootstrapping with adj
(adjusts for mult samples repeatedly subsampled - reduces some bias),
cv=cross validation, repeatedcv=repeated cv (sub-cv with difft
random draws), LOOCV = leave one out cross validation
• number: of samples for boot/cross validation; number of subsamples to
take
• repeats: number of times to repeat subsampling; if big this can slow
things down
• Another example of training option for Resampling is to set
the value for Seed, which is useful when performing parallel
fits

309
Plotting Predictors

• It is often useful to plot predictors prior to performing


Machine Learning
• Ggplot2 and caret provides several plots which can provide
useful clues

Plots Usage
Featureplot (caret) Shows scatter plot of predictors by classes, to show
correlation
Qplot (ggplot) Shows scatter plot of predictors by classes (using
colour), to show correlation. Possible to add
regression smoothers (to show regression).
Also able to plot density plot
Cut2(Hmisc) Convert numeric variables into intervals (factors).
Also can plot boxplot (show distribution of data)
table Tabular results of categorical data 310
Notes on plotting predictors

•Make plots only on training set (do not use


test set for exploration!)
•Look for:
• * imbalance in outcomes / predictors
• * outliers
• * groups of points not explained by a predictor
• * skewed variables

311
Preprocessing

• Why do we need to preprocess?


• Strange distributions, skewed vars in model-based
predictors
• Histogram & mean (mean not in centre shows
skew)/standard deviation(watch out for highly inflated
values!) can show skew in data

312
How to standardize (preProcess function)

• Generally, we can standardize by subtracting the


mean of a predictor (column in your data set) and
then dividing by the standard deviation
• This will give a mean of zero and a standard
deviation of one
• This can be performed by using the preProcess
function in the caret package
• The preProcess function needs to be performed for
both the training and the testing dataset
• Preprocessing can also be incorporated in the
model fitting stage, to be passed as an argument
313
How to preprocess continuous data into
normal data
• If the dataset comprises continuous data (data
measurable on a scale) , it can be transformed into
normal data (normally distributed data) using the
Box-Cox transform.
• Box-Cox transform can be entered as an argument
for method in the preProcess function

314
How to handle missing values

• Some algorithms cannot handle missing data


• Missing data can be imputed using the preProcess
function
• knnImpute can be entered as an argument for
method in the preProcess function
• Subsequently, once the missing data has been
handled, remember to standardize!
• REMINDER: Replace first, then standardize

315
On pre-processing

• Training and test must be processed in the same way


• Test transformations will likely be imperfect,
especially if the test / training sets were collected at
different times
• Be careful when transforming factor variables!

316
Covariate creation

• Covariates aka predictors, features – are variables


you will use in your model to get at what outcome
you care about
• 2 levels of covariate creation:
• Level 1: from raw data to covariate: eg raw email
text, get the avg capital letters, # times 'you'
appears, etc. - features that describe the raw email
• Level 2: transforming tidy covariates; ie no
additional feature extraction steps needed

317
Level 1
• Level 1, Raw data to covariates
• Depends heavily on application
• The balancing act is summarization vs info loss
• Examples:
• * text files: freq of words, phrases (Google ngrams), capital letters
• * images: edges, corners, blobs, ridges (computer vision feature detection)
• * webpages: number and type of images, position of elements, colors,
videos (A/B testing, aka randomized trials in statistics)
• * people: height, weight, hair color, sex, country of origin
• The more knowledge of the system you have, the better the job
you will do
• When in doubt, err on the side of more features
• It can be automated, but use caution! May be important for
training but won't generalize well for test
318
Level 2

• Level 2, tidy covariates to new covariates (functions


on covariates)
• More necessary for some methods (regression, svms)
than for others (classification trees)
• Only do on training set!
• Best approach through exploratory analysis (plotting
/ tables)
• New covariates should be added to data frames

319
Additional steps for covariates

•Common covariates to add, dummy variables


• Basic idea: convert factor variables to indicator
variables
• E.g var jobclass = industrial, information; change
so 2 vars, industrial, info, with values 0 or 1
•Removing zero covariates
• Some variables have no variability at all - eg
here, sex always male

320
Level 1 and Level 2 Covariates
discussion
• Level 1 feature creation (raw to covariates):
• Science is key; Google 'feature extraction for [data type]' eg images,
voice, etc.
• Err on overcreation of features
• In some applications (images, voices), automated feature creation is
possible / necessary
• Level 2 feature creation (covariates to new cov)
• Function preProcess in caret will handle some preprocessing
• Create new covariates if you think they will improve fit
• Use exploratory analysis on the training set for creating them
• Be careful about overfitting!
• Preprocessing with caret, see instructions

321
Preprocessing with PCA (principal components analysis)

• In a dataset, we might have multiple quantitative


variables, some highly correlated with others
• With PCA, the basic ideas are:
• We might not need every predictor (like here, num415 and
num857, do not need both)
• A weighted combination of predictors might be better
• We should pick this combo to capture the 'most info'
possible
• Benefits: reduced # of predictors, reduced noise due to
averaging, reduce computation time

322
What of multivariate variables?

• Related problems:
• Multivariate variables X1, ... Xn so X1 = (X11, ... X1m)
• Find a new set of multivariate vars that are uncorrelated and
explain as much variance as possible (in example, use X and
throw out Y and original vars)
• If you put all the vars together in one matrix, find the best
matrix created with fewer variables (lower rank) that explains
the original data
• The first goal is statistical, and the 2nd is data compression

323
PCA/SVD

• Related solutions - PCA / SVD


• SVD: if X is a matrix with each var in a column and each obs
in a row, then the SVD is a 'matrix decomposition', X = UDV^T,
(ie 3 matrices), where the columns of U are orthogonal (left
singular vectors), the columns of V are orthogonal (right
singular vectors) and D is a diagonal matrix (singular values)
• PCA: the principal components are equal to the right singular
values if you first scale (subtract mean, divide by sd) the vars)

324
Principal components in R: prcomp

• The function prComp can be used to show the


principal components of a dataset
• The idea is to get the principal components which
shows the most separation
• With the caret package, PCA can be performed by
specifying the method argument to “pca” when
calling the preProcess function
• Or it can be passed straight to the train function

325
Final ideas on PCA

•Most useful for linear-type models


•Can make it harder to interpret predictors
•Watch out for outliers! explore first, and
transform first (with logs / Box Cox); plot
predictors to identify problems

326
Prediction with regression

•Key ideas
•Fit a simple regression model
•Plug in new covariates and multiply by the
coefficients
•Useful when the linear model is nearly correct
•Pros: easy to implement and interpret
•Cons: often poor performance in nonlinear
settings

327
Regression in R

• Linear regression is performed by calling the lm


function
• RMSE is used as the error measure for both training
and testing set
• In caret, regression is used as the entry for the
method argument
• Regression models with multiple covariates can be
included
• Often useful in combination with other models
328
Predicting with Regression Using Multiple
Covariates

• Also about exploring what are most important predictors


• Featureplot is used to pick the most important predictors
• Fit a linear model, converts jobclass and education to
indicator variables (converts factors to indicators by default)
• Diagnostics for predictors chosen can be performed by
plotting the residuals of the final model versus the fitted
model (here the colour plot reveals race as a potential
predictor variable)
• Prediction diagnostics can be performed as a plot
• May also use all co-variates at the risk of overfitting
• Regression is often used together with other models
329
Lunch Break
Please be back by 2:00 pm
1400 - 1500
SHINY
What is Shiny?
• Shiny is a platform for creating interactive R programs
embedded into a web page.
• Suppose that you create a prediction algorithm, with
shiny you can very easily create web input form that
calls R and thus your prediction algorithm and displays
the results.
• Using Shiny, the time to create simple, yet powerful,
web-based interactive data products in R is minimized.
• However, it lacks the flexibility of full featured (and more
complex) solutions
• Shiny is made by the folks at RStudio.

332
Shiny showcase

https://www.rstudio.com/p
roducts/shiny/shiny-user-
showcase/

333
The architecture of Shiny
• Every Shiny app is maintained by a computer running
R

Server User
Instructions Interface (UI)
334
shinyapps.io

• Shinyapps.io is a platform as a service (PaaS) for


hosting Shiny web apps (applications).
• Before you get started with shinyapps.io, you will
need:
• An R development environment, such as the RStudio IDE
• (for Windows users only) RTools for building packages
• (for Mac users only) XCode Command Line Tools for
building packages
• (for Linux users only) GCC
• The devtools R package (version 1.4 or later)
• The latest version of the rsconnect R package
335
Installation of devtools and rsconnect
devtools
• Shinyapps.io uses the latest improvements to the devtools package. To
use shinyapps.io, you must update devtools to version 1.4 or later. To
install devtools from CRAN, run the code below. Then restart your R
session.
> install.packages('devtools')
rsconnect
• The rsconnect package deploys applications to the shinyapps.io service.
Currently, you need to install the rsconnect package from its
development page at Github. You can do this by running the R
command:
> devtools::install_github('rstudio/rsconnect')
After the rsconnect package has been installed, load it into your R
session:
> library(rsconnect) 336
shinyapps package

• Installation of shinyapps package:


• To install directly from GitHub, run this:

devtools::install_github("rstudio/sh
inyapps")

337
Create a shinyapps.io account

• Go to shinyapps.io and click “Log In.”


• If you do not have an account, create a new account
or sign in using your Google or github account
• The first time you sign in, shinyapps.io prompts you to
setup your account. Shinyapps.io uses the account
name as the domain name for all your apps.
• Once you set up your account in shinyapps.io, you
can configure the rsconnect package to use your
account.

338
Configure the rsconnect package

• Click the show button


on the token page.
• Copy the command
to your clip board,
and then paste it into
the command line of
RStudio and click
enter.

339
App template

library(shiny)

ui <- fluidPage()

server <- function(input,


output) {}

shinyApp(ui = ui, server =


server)
340
Input and Output

•Add elements to your app as arguments to


fluidPage()

ui <- fluidPage(
# *Input() functions,
# *Output() functions
)

341
Input

342
Create an Input with an Input() function

sliderInput(inp
utId = "num",
label = "Choose
a number",
value=10, min =
1, max = 50)

343
Input syntax
sliderInput(inputId = "num", label = "Choose a
number", …)

Input name
Label to be Specific
(for internal
displayed arguments
usage)

344
Output

345
Output syntax

• To display output, add it to fluidPage() with an


*Output() function

plotOutput("hist")

the type of name to give


display to the
output output object

346
Server Function

•Use 3 rules to write the server function


•Save objects to display to output$
•Build objects to display with render*()
•Access input values with input$

•Reactivity automatically occurs whenever


you use an input value to render an
output object
347
Server function

• Save objects to display to output$


server <- function(input,
output) {
output$hist <- # code

output$hist

plotOutput("hist")
348
Server function

•Build objects to display with render*()


server <- function(input, output) {
output$hist <- renderPlot({

})
}

349
render() function
• Use the render*() function that creates the type of
output you wish to make.
renderPlot({ hist(rnorm(100)) })

code block that


type of object to
builds
build
the object

350
Server function

•Build objects to display with render*()


server <- function(input,
output) {
output$hist <- renderPlot({
title <- "50 random
normal values"
hist(rnorm(50))

})
}

351
Server function

•Access input values with input$


server <- function(input, output) {
output$hist <- renderPlot({
hist(rnorm(input$num))

})
}
sliderInput(inputId = "num",…)

input$num
352
Input values
•The input value changes whenever a user
changes the input.

input$num = 10

input$num = 20

input$num = 35
353
Reactivity in R as two steps process

1. Reactive
values notify input$num
the functions
that use them
when they
become invalid

2. The objects
renderPlot({
created by the
hist(rnorm(input$
reactive
num))})
functions
respond 354
Running and Deploying apps

• Make sure you have one


directory with every file the app
needs:
• app.R (your script which ends with a
call to shinyApp())
• All your datasets, images, css,
helper scripts, etc. in the same
folder
• To run your shiny app, click on
the Run App button
• Click on the publish button to
submit your shiny app to
shinyapps.io
355
Exercise 1: Create first shiny page
library(shiny)
ui <- fluidPage("Hello World!")
server <- function(input, output) {}
shinyApp(ui = ui, server = server)

356
Exercise 2: Customize your shiny page
library(shiny)
ui <- fluidPage(
titlePanel("title panel"),
sidebarLayout(
sidebarPanel( "sidebar
panel"),
mainPanel("main panel",
h1("First level title"))
)
)

server <- function(input,


output) {}
shinyApp(ui = ui, server =
server)

357
Exercise 3: Add Input and Output

library(shiny)
ui <- fluidPage(
titlePanel("Page with Slider Input
Demo"),
sidebarLayout(
sidebarPanel(
sliderInput(inputId =
"num",label = "Choose a number", * Output()
value = 10, min = 1, max = 50) adds a space
in
), the ui for an
mainPanel("main panel", R object.
plotOutput("hist")
) You must
)) build the
object
server <- function(input, output) {} in the server
shinyApp(ui = ui, server = server) function
358
Exercise 4: Write the server function
library(shiny)
ui <- fluidPage(
titlePanel("Page with Slider Input
Demo"),
sidebarLayout(
sidebarPanel(
sliderInput(inputId = "num",label =
"Choose a number", value = 10, min =
1, max = 50)),
mainPanel("main panel",
plotOutput("hist")
)
))
server <- function(input, output) {
output$hist <- renderPlot({
hist(rnorm(input$num))
})
}
shinyApp(ui = ui, server = server)
359
6
0
The two files app: ui.R and server.R
library(shiny) # ui.R
library(shiny)

ui <- fluidPage( fluidPage(


titlePanel("Page with Slider Input Demo"),
titlePanel("Page with Slider Input
Demo"), sidebarLayout(
sidebarLayout( sidebarPanel(
sidebarPanel( sliderInput(inputId = "num",label = "Choose
a number", value = 10, min = 1, max = 50)),
sliderInput(inputId = "num",label = mainPanel("main panel",
"Choose a number", value = 10, min = 1,
max = 50)), plotOutput("hist")
mainPanel("main panel", )

plotOutput("hist") ))

)
)) # server.R
library(shiny)
server <- function(input, output) {
server <- function(input, output) {
output$hist <-
renderPlot({hist(rnorm(input$num)) output$hist <-
}) renderPlot({hist(rnorm(input$num))
} })
}
shinyApp(ui = ui, server = server)
The two files app

•One directory with two files: You MUST use


• server.R the exact file
• ui.R names –
server.R and
ui.R

361
Exercise 5: Layout
library(shiny)
ui <- fluidPage(
titlePanel("Page with Multi-tab Panel"),
sidebarLayout(
sidebarPanel(sliderInput(inputId = "num",label
= "Choose a number", value = 10, min = 1, max =
50)
),
mainPanel(
tabsetPanel(
tabPanel("Plot", plotOutput("hist")),
tabPanel("Summary",
verbatimTextOutput("summary"))
)
)))
server <- function(input, output) {
output$hist <-
renderPlot({hist(rnorm(input$num))})
output$summary<-
renderPrint({summary(rnorm(input$num))})
}
362
shinyApp(ui = ui, server = server)
Afternoon Break
Please be back by 4:00 pm
1600-1700
FINAL EXERCISE AND
WRAP-UP

You might also like